[tripleo][ci] container pulls failing
Bogdan Dobrelya
bdobreli at redhat.com
Wed Aug 19 14:34:34 UTC 2020
On 8/19/20 4:31 PM, Bogdan Dobrelya wrote:
> On 8/19/20 3:55 PM, Alex Schultz wrote:
>> On Wed, Aug 19, 2020 at 7:53 AM Bogdan Dobrelya <bdobreli at redhat.com>
>> wrote:
>>>
>>> On 8/19/20 3:23 PM, Alex Schultz wrote:
>>>> On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails at gmail.com> wrote:
>>>>>
>>>>> Hey folks,
>>>>>
>>>>> All of the latest patches to address this have been merged in but
>>>>> we are still seeing this error randomly in CI jobs that involve an
>>>>> Undercloud or Standalone node. As far as I can tell, the error is
>>>>> appearing less often than before but it is still present making
>>>>> merging new patches difficult. I would be happy to help work
>>>>> towards other possible solutions however I am unsure where to start
>>>>> from here. Any help would be greatly appreciated.
>>>>>
>>>>
>>>> I'm looking at this today but from what I can tell the problem is
>>>> likely caused by a reduced anonymous query quota from docker.io and
>>>> our usage of the upstream mirrors. Because the mirrors essentially
>>>> funnel all requests through a single IP we're hitting limits faster
>>>> than if we didn't use the mirrors. Due to the nature of the requests,
>>>> the metadata queries don't get cached due to the authorization header
>>>> but are subject to the rate limiting. Additionally we're querying the
>>>> registry to determine which containers we need to update in CI because
>>>> we limit our updates to a certain set of containers as part of the CI
>>>> jobs.
>>>>
>>>> So there are likely a few different steps forward on this and we can
>>>> do a few of these together.
>>>>
>>>> 1) stop using mirrors (not ideal but likely makes this go away).
>>>> Alternatively switch stable branches off the mirrors due to a reduced
>>>> number of executions and leave mirrors configured on master only (or
>>>> vice versa).
>>>
>>> Also, the stable/(N-1) branch could use quay.io, while master keeps
>>> using docker.io (assuming containers for that N-1 release will be hosted
>>> there instead of the dockerhub)
>>>
>>
>> quay has its own limits and likely will suffer from a similar problem.
>
> Right. But dropped numbers of total requests sent to each registry could
> end up with less often rate limiting by either of two.
>
>>
>>>> 2) reduce the number of jobs
>>>> 3) stop querying the registry for the update filters (i'm looking into
>>>> this today) and use the information in tripleo-common first.
>>>> 4) build containers always instead of fetching from docker.io
>
> There may be a middle-ground solution. Building it only once for each
> patchset executed in TripleO Zuul pipelines. Transient images, like [0],
> that can have TTL and self-expire should be used for that purpose.
>
> [0]
> https://idbs-engineering.com/containers/2019/08/27/auto-expiry-quayio-tags.html
>
>
> That would require the zuul jobs with dependencies passing ansible
> variables to each other, by the execution results. Can that be done?
...or even simpler than that, predictable names can be created for those
transient images, like <namespace>/<tag>_<patchset>
>
> Pretty much like we have it already set in TripleO for tox jobs as a
> dependency for standalone/multinode jobs. But adding an extra step to
> prepare such a transient pack of the container images (only to be used
> for that patchset) and push it to a quay registry hosted elsewhere by
> TripleO devops folks.
>
> Then the jobs that have that dependency met can use those transient
> images via an ansible variable passed for the jobs. Auto expiration
> solves the space/lifecycle requirements for the cloud that will be
> hosting that registry.
>
>>>>
>>>> Thanks,
>>>> -Alex
>>>>
>>>>
>>>>
>>>>> Sincerely,
>>>>> Luke Short
>>>>>
>>>>> On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin
>>>>> <whayutin at redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin
>>>>>> <whayutin at redhat.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz
>>>>>>> <aschultz at redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin
>>>>>>>> <whayutin at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya
>>>>>>>>> <bdobreli at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi
>>>>>>>>>>> <emilien at redhat.com
>>>>>>>>>>> <mailto:emilien at redhat.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz
>>>>>>>>>>> <aschultz at redhat.com
>>>>>>>>>>> <mailto:aschultz at redhat.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>>>>>>>>>>> <emilien at redhat.com <mailto:emilien at redhat.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>>>>>>>>>>> <whayutin at redhat.com <mailto:whayutin at redhat.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> FYI...
>>>>>>>>>>> >>
>>>>>>>>>>> >> If you find your jobs are failing with an error
>>>>>>>>>>> similar to
>>>>>>>>>>> [1], you have been rate limited by docker.io
>>>>>>>>>>> <http://docker.io>
>>>>>>>>>>> via the upstream mirror system and have hit [2].
>>>>>>>>>>> I've been
>>>>>>>>>>> discussing the issue w/ upstream infra, rdo-infra
>>>>>>>>>>> and a few CI
>>>>>>>>>>> engineers.
>>>>>>>>>>> >>
>>>>>>>>>>> >> There are a few ways to mitigate the issue
>>>>>>>>>>> however I don't
>>>>>>>>>>> see any of the options being completed very quickly
>>>>>>>>>>> so I'm
>>>>>>>>>>> asking for your patience while this issue is
>>>>>>>>>>> socialized and
>>>>>>>>>>> resolved.
>>>>>>>>>>> >>
>>>>>>>>>>> >> For full transparency we're considering the
>>>>>>>>>>> following options.
>>>>>>>>>>> >>
>>>>>>>>>>> >> 1. move off of docker.io <http://docker.io> to
>>>>>>>>>>> quay.io
>>>>>>>>>>> <http://quay.io>
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > quay.io <http://quay.io> also has API rate limit:
>>>>>>>>>>> > https://docs.quay.io/issues/429.html
>>>>>>>>>>> >
>>>>>>>>>>> > Now I'm not sure about how many requests per
>>>>>>>>>>> seconds one can
>>>>>>>>>>> do vs the other but this would need to be checked
>>>>>>>>>>> with the quay
>>>>>>>>>>> team before changing anything.
>>>>>>>>>>> > Also quay.io <http://quay.io> had its big
>>>>>>>>>>> downtimes as well,
>>>>>>>>>>> SLA needs to be considered.
>>>>>>>>>>> >
>>>>>>>>>>> >> 2. local container builds for each job in
>>>>>>>>>>> master, possibly
>>>>>>>>>>> ussuri
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Not convinced.
>>>>>>>>>>> > You can look at CI logs:
>>>>>>>>>>> > - pulling / updating / pushing container images
>>>>>>>>>>> from
>>>>>>>>>>> docker.io <http://docker.io> to local registry
>>>>>>>>>>> takes ~10 min on
>>>>>>>>>>> standalone (OVH)
>>>>>>>>>>> > - building containers from scratch with updated
>>>>>>>>>>> repos and
>>>>>>>>>>> pushing them to local registry takes ~29 min on
>>>>>>>>>>> standalone (OVH).
>>>>>>>>>>> >
>>>>>>>>>>> >>
>>>>>>>>>>> >> 3. parent child jobs upstream where rpms and
>>>>>>>>>>> containers will
>>>>>>>>>>> be build and host artifacts for the child jobs
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Yes, we need to investigate that.
>>>>>>>>>>> >
>>>>>>>>>>> >>
>>>>>>>>>>> >> 4. remove some portion of the upstream jobs to
>>>>>>>>>>> lower the
>>>>>>>>>>> impact we have on 3rd party infrastructure.
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > I'm not sure I understand this one, maybe you
>>>>>>>>>>> can give an
>>>>>>>>>>> example of what could be removed?
>>>>>>>>>>>
>>>>>>>>>>> We need to re-evaulate our use of scenarios (e.g.
>>>>>>>>>>> we have two
>>>>>>>>>>> scenario010's both are non-voting). There's a
>>>>>>>>>>> reason we
>>>>>>>>>>> historically
>>>>>>>>>>> didn't want to add more jobs because of these types
>>>>>>>>>>> of resource
>>>>>>>>>>> constraints. I think we've added new jobs recently
>>>>>>>>>>> and likely
>>>>>>>>>>> need to
>>>>>>>>>>> reduce what we run. Additionally we might want to
>>>>>>>>>>> look into reducing
>>>>>>>>>>> what we run on stable branches as well.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Oh... removing jobs (I thought we would remove some
>>>>>>>>>>> steps of the jobs).
>>>>>>>>>>> Yes big +1, this should be a continuous goal when
>>>>>>>>>>> working on CI, and
>>>>>>>>>>> always evaluating what we need vs what we run now.
>>>>>>>>>>>
>>>>>>>>>>> We should look at:
>>>>>>>>>>> 1) services deployed in scenarios that aren't worth
>>>>>>>>>>> testing (e.g.
>>>>>>>>>>> deprecated or unused things) (and deprecate the unused
>>>>>>>>>>> things)
>>>>>>>>>>> 2) jobs themselves (I don't have any example beside
>>>>>>>>>>> scenario010 but
>>>>>>>>>>> I'm sure there are more).
>>>>>>>>>>> --
>>>>>>>>>>> Emilien Macchi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks Alex, Emilien
>>>>>>>>>>>
>>>>>>>>>>> +1 to reviewing the catalog and adjusting things on an
>>>>>>>>>>> ongoing basis.
>>>>>>>>>>>
>>>>>>>>>>> All.. it looks like the issues with docker.io
>>>>>>>>>>> <http://docker.io> were
>>>>>>>>>>> more of a flare up than a change in docker.io
>>>>>>>>>>> <http://docker.io> policy
>>>>>>>>>>> or infrastructure [2]. The flare up started on July 27 8am
>>>>>>>>>>> utc and
>>>>>>>>>>> ended on July 27 17:00 utc, see screenshots.
>>>>>>>>>>
>>>>>>>>>> The numbers of image prepare workers and its exponential fallback
>>>>>>>>>> intervals should be also adjusted. I've analysed the log
>>>>>>>>>> snippet [0] for
>>>>>>>>>> the connection reset counts by workers versus the times the rate
>>>>>>>>>> limiting was triggered. See the details in the reported bug [1].
>>>>>>>>>>
>>>>>>>>>> tl;dr -- for an example 5 sec interval 03:55:31,379 -
>>>>>>>>>> 03:55:36,110:
>>>>>>>>>>
>>>>>>>>>> Conn Reset Counts by a Worker PID:
>>>>>>>>>> 3 58412
>>>>>>>>>> 2 58413
>>>>>>>>>> 3 58415
>>>>>>>>>> 3 58417
>>>>>>>>>>
>>>>>>>>>> which seems too much of (workers*reconnects) and triggers rate
>>>>>>>>>> limiting
>>>>>>>>>> immediately.
>>>>>>>>>>
>>>>>>>>>> [0]
>>>>>>>>>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Bogdan Dobrelya,
>>>>>>>>>> Irc #bogdando
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FYI..
>>>>>>>>>
>>>>>>>>> The issue w/ "too many requests" is back. Expect delays and
>>>>>>>>> failures in attempting to merge your patches upstream across
>>>>>>>>> all branches. The issue is being tracked as a critical issue.
>>>>>>>>
>>>>>>>> Working with the infra folks and we have identified the
>>>>>>>> authorization
>>>>>>>> header as causing issues when we're rediected from docker.io to
>>>>>>>> cloudflare. I'll throw up a patch tomorrow to handle this case
>>>>>>>> which
>>>>>>>> should improve our usage of the cache. It needs some testing
>>>>>>>> against
>>>>>>>> other registries to ensure that we don't break authenticated
>>>>>>>> fetching
>>>>>>>> of resources.
>>>>>>>>
>>>>>>> Thanks Alex!
>>>>>>
>>>>>>
>>>>>>
>>>>>> FYI.. we have been revisited by the container pull issue, "too
>>>>>> many requests".
>>>>>> Alex has some fresh patches on it:
>>>>>> https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+topic:bug/1889122
>>>>>>
>>>>>>
>>>>>> expect trouble in check and gate:
>>>>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Irc #bogdando
>>>
>>
>
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
More information about the openstack-discuss
mailing list