Re: [tripleo][ci] container pulls failing

19 Aug 2020


      On 8/19/20 4:31 PM, Bogdan Dobrelya wrote:
...
On 8/19/20 3:55 PM, Alex Schultz wrote:
...
On Wed, Aug 19, 2020 at 7:53 AM Bogdan Dobrelya <bdobreli@redhat.com> 
wrote:
...
On 8/19/20 3:23 PM, Alex Schultz wrote:
...
On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:
...
Hey folks,
All of the latest patches to address this have been merged in but 
we are still seeing this error randomly in CI jobs that involve an 
Undercloud or Standalone node. As far as I can tell, the error is 
appearing less often than before but it is still present making 
merging new patches difficult. I would be happy to help work 
towards other possible solutions however I am unsure where to start 
from here. Any help would be greatly appreciated.
I'm looking at this today but from what I can tell the problem is
likely caused by a reduced anonymous query quota from docker.io and
our usage of the upstream mirrors.  Because the mirrors essentially
funnel all requests through a single IP we're hitting limits faster
than if we didn't use the mirrors. Due to the nature of the requests,
the metadata queries don't get cached due to the authorization header
but are subject to the rate limiting.  Additionally we're querying the
registry to determine which containers we need to update in CI because
we limit our updates to a certain set of containers as part of the CI
jobs.
So there are likely a few different steps forward on this and we can
do a few of these together.
1) stop using mirrors (not ideal but likely makes this go away).
Alternatively switch stable branches off the mirrors due to a reduced
number of executions and leave mirrors configured on master only (or
vice versa).
Also, the stable/(N-1) branch could use quay.io, while master keeps
using docker.io (assuming containers for that N-1 release will be hosted
there instead of the dockerhub)
quay has its own limits and likely will suffer from a similar problem.
Right. But dropped numbers of total requests sent to each registry could 
end up with less often rate limiting by either of two.
...
...
...
2) reduce the number of jobs
3) stop querying the registry for the update filters (i'm looking into
this today) and use the information in tripleo-common first.
4) build containers always instead of fetching from docker.io
There may be a middle-ground solution. Building it only once for each 
patchset executed in TripleO Zuul pipelines. Transient images, like [0], 
that can have TTL and self-expire should be used for that purpose.
[0] 
https://idbs-engineering.com/containers/2019/08/27/auto-expiry-quayio-tags.h...
That would require the zuul jobs with dependencies passing ansible 
variables to each other, by the execution results. Can that be done?
...or even simpler than that, predictable names can be created for those 
transient images, like <namespace>/<tag>_<patchset>
...
Pretty much like we have it already set in TripleO for tox jobs as a 
dependency for standalone/multinode jobs. But adding an extra step to 
prepare such a transient pack of the container images (only to be used 
for that patchset) and push it to a quay registry hosted elsewhere by 
TripleO devops folks.
Then the jobs that have that dependency met can use those transient 
images via an ansible variable passed for the jobs. Auto expiration 
solves the space/lifecycle requirements for the cloud that will be 
hosting that registry.
...
...
...
Thanks,
-Alex
...
Sincerely,
      Luke Short
On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin 
<whayutin@redhat.com> wrote:
...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin 
<whayutin@redhat.com> wrote:
>
>
>
> On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz 
> <aschultz@redhat.com> wrote:
>>
>> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin 
>> <whayutin@redhat.com> wrote:
>>>
>>>
>>>
>>> On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya 
>>> <bdobreli@redhat.com> wrote:
>>>>
>>>> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>>>>>
>>>>>
>>>>> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi 
>>>>> <emilien@redhat.com
>>>>> <mailto:emilien@redhat.com>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>       On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz 
>>>>> <aschultz@redhat.com
>>>>>       <mailto:aschultz@redhat.com>> wrote:
>>>>>
>>>>>           On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>>>>>           <emilien@redhat.com <mailto:emilien@redhat.com>> 
>>>>> wrote:
>>>>>            >
>>>>>            >
>>>>>            >
>>>>>            > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>>>>>           <whayutin@redhat.com <mailto:whayutin@redhat.com>> 
>>>>> wrote:
>>>>>            >>
>>>>>            >> FYI...
>>>>>            >>
>>>>>            >> If you find your jobs are failing with an error 
>>>>> similar to
>>>>>           [1], you have been rate limited by docker.io 
>>>>> <http://docker.io>
>>>>>           via the upstream mirror system and have hit [2].  
>>>>> I've been
>>>>>           discussing the issue w/ upstream infra, rdo-infra 
>>>>> and a few CI
>>>>>           engineers.
>>>>>            >>
>>>>>            >> There are a few ways to mitigate the issue 
>>>>> however I don't
>>>>>           see any of the options being completed very quickly 
>>>>> so I'm
>>>>>           asking for your patience while this issue is 
>>>>> socialized and
>>>>>           resolved.
>>>>>            >>
>>>>>            >> For full transparency we're considering the 
>>>>> following options.
>>>>>            >>
>>>>>            >> 1. move off of docker.io <http://docker.io> to 
>>>>> quay.io
>>>>>           <http://quay.io>
>>>>>            >
>>>>>            >
>>>>>            > quay.io <http://quay.io> also has API rate limit:
>>>>>            > https://docs.quay.io/issues/429.html
>>>>>            >
>>>>>            > Now I'm not sure about how many requests per 
>>>>> seconds one can
>>>>>           do vs the other but this would need to be checked 
>>>>> with the quay
>>>>>           team before changing anything.
>>>>>            > Also quay.io <http://quay.io> had its big 
>>>>> downtimes as well,
>>>>>           SLA needs to be considered.
>>>>>            >
>>>>>            >> 2. local container builds for each job in 
>>>>> master, possibly
>>>>>           ussuri
>>>>>            >
>>>>>            >
>>>>>            > Not convinced.
>>>>>            > You can look at CI logs:
>>>>>            > - pulling / updating / pushing container images 
>>>>> from
>>>>>           docker.io <http://docker.io> to local registry 
>>>>> takes ~10 min on
>>>>>           standalone (OVH)
>>>>>            > - building containers from scratch with updated 
>>>>> repos and
>>>>>           pushing them to local registry takes ~29 min on 
>>>>> standalone (OVH).
>>>>>            >
>>>>>            >>
>>>>>            >> 3. parent child jobs upstream where rpms and 
>>>>> containers will
>>>>>           be build and host artifacts for the child jobs
>>>>>            >
>>>>>            >
>>>>>            > Yes, we need to investigate that.
>>>>>            >
>>>>>            >>
>>>>>            >> 4. remove some portion of the upstream jobs to 
>>>>> lower the
>>>>>           impact we have on 3rd party infrastructure.
>>>>>            >
>>>>>            >
>>>>>            > I'm not sure I understand this one, maybe you 
>>>>> can give an
>>>>>           example of what could be removed?
>>>>>
>>>>>           We need to re-evaulate our use of scenarios (e.g. 
>>>>> we have two
>>>>>           scenario010's both are non-voting).  There's a 
>>>>> reason we
>>>>>           historically
>>>>>           didn't want to add more jobs because of these types 
>>>>> of resource
>>>>>           constraints.  I think we've added new jobs recently 
>>>>> and likely
>>>>>           need to
>>>>>           reduce what we run. Additionally we might want to 
>>>>> look into reducing
>>>>>           what we run on stable branches as well.
>>>>>
>>>>>
>>>>>       Oh... removing jobs (I thought we would remove some 
>>>>> steps of the jobs).
>>>>>       Yes big +1, this should be a continuous goal when 
>>>>> working on CI, and
>>>>>       always evaluating what we need vs what we run now.
>>>>>
>>>>>       We should look at:
>>>>>       1) services deployed in scenarios that aren't worth 
>>>>> testing (e.g.
>>>>>       deprecated or unused things) (and deprecate the unused 
>>>>> things)
>>>>>       2) jobs themselves (I don't have any example beside 
>>>>> scenario010 but
>>>>>       I'm sure there are more).
>>>>>       --
>>>>>       Emilien Macchi
>>>>>
>>>>>
>>>>> Thanks Alex, Emilien
>>>>>
>>>>> +1 to reviewing the catalog and adjusting things on an 
>>>>> ongoing basis.
>>>>>
>>>>> All.. it looks like the issues with docker.io 
>>>>> <http://docker.io> were
>>>>> more of a flare up than a change in docker.io 
>>>>> <http://docker.io> policy
>>>>> or infrastructure [2].  The flare up started on July 27 8am 
>>>>> utc and
>>>>> ended on July 27 17:00 utc, see screenshots.
>>>>
>>>> The numbers of image prepare workers and its exponential fallback
>>>> intervals should be also adjusted. I've analysed the log 
>>>> snippet [0] for
>>>> the connection reset counts by workers versus the times the rate
>>>> limiting was triggered. See the details in the reported bug [1].
>>>>
>>>> tl;dr -- for an example 5 sec interval 03:55:31,379 - 
>>>> 03:55:36,110:
>>>>
>>>> Conn Reset Counts by a Worker PID:
>>>>          3 58412
>>>>          2 58413
>>>>          3 58415
>>>>          3 58417
>>>>
>>>> which seems too much of (workers*reconnects) and triggers rate 
>>>> limiting
>>>> immediately.
>>>>
>>>> [0]
>>>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... 
>>>>
>>>>
>>>> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>>>>
>>>> -- 
>>>> Best regards,
>>>> Bogdan Dobrelya,
>>>> Irc #bogdando
>>>>
>>>
>>> FYI..
>>>
>>> The issue w/ "too many requests" is back.  Expect delays and 
>>> failures in attempting to merge your patches upstream across 
>>> all branches.   The issue is being tracked as a critical issue.
>>
>> Working with the infra folks and we have identified the 
>> authorization
>> header as causing issues when we're rediected from docker.io to
>> cloudflare. I'll throw up a patch tomorrow to handle this case 
>> which
>> should improve our usage of the cache.  It needs some testing 
>> against
>> other registries to ensure that we don't break authenticated 
>> fetching
>> of resources.
>>
> Thanks Alex!
FYI.. we have been revisited by the container pull issue, "too 
many requests".
Alex has some fresh patches on it: 
https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...
expect trouble in check and gate:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...
-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando
-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando