[tripleo][ci] container pulls failing

newer
[cinder][ops] "Berlin" 2020...

Wesley Hayutin

27 Jul 2020 27 Jul '20

9:17 p.m.

FYI... If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers. There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. For full transparency we're considering the following options. 1. move off of docker.io to quay.io 2. local container builds for each job in master, possibly ussuri 3. parent child jobs upstream where rpms and containers will be build and host artifacts for the child jobs 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. If you have thoughts please don't hesitate to share on this thread. Very sorry we're hitting these failures and I really appreciate your patience. I would expect major delays in getting patches merged at this point until things are resolved. Thank you! [1] HTTPError: 429 Client Error: Too Many Requests for url: http://mirror.ca-ymq-1.vexxhost.opendev.org:8082/v2/tripleotrain/centos-bina... [2] https://bugs.launchpad.net/tripleo/+bug/1889122

Attachments:

attachment.html (text/html — 1.9 KB)

Show replies by date

Emilien Macchi

28 Jul 28 Jul

1:06 p.m.

On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...

FYI...

If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers.

There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved.

For full transparency we're considering the following options.

1. move off of docker.io to quay.io

quay.io also has API rate limit: https://docs.quay.io/issues/429.html Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. Also quay.io had its big downtimes as well, SLA needs to be considered. 2. local container builds for each job in master, possibly ussuri

...

Not convinced. You can look at CI logs: - pulling / updating / pushing container images from docker.io to local registry takes ~10 min on standalone (OVH) - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH).

...

3. parent child jobs upstream where rpms and containers will be build and host artifacts for the child jobs

Yes, we need to investigate that.

...

4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure.

I'm not sure I understand this one, maybe you can give an example of what could be removed?

...

If you have thoughts please don't hesitate to share on this thread. Very sorry we're hitting these failures and I really appreciate your patience. I would expect major delays in getting patches merged at this point until things are resolved.

Thank you!

[1] HTTPError: 429 Client Error: Too Many Requests for url: http://mirror.ca-ymq-1.vexxhost.opendev.org:8082/v2/tripleotrain/centos-bina... [2] https://bugs.launchpad.net/tripleo/+bug/1889122

-- Emilien Macchi

Alex Schultz

1:20 p.m.

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com> wrote:

...

On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
FYI...

If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers.

There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved.

For full transparency we're considering the following options.

1. move off of docker.io to quay.io

quay.io also has API rate limit: https://docs.quay.io/issues/429.html

Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. Also quay.io had its big downtimes as well, SLA needs to be considered.

...
2. local container builds for each job in master, possibly ussuri

Not convinced. You can look at CI logs: - pulling / updating / pushing container images from docker.io to local registry takes ~10 min on standalone (OVH) - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH).

...
3. parent child jobs upstream where rpms and containers will be build and host artifacts for the child jobs

Yes, we need to investigate that.

...
4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure.

I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.

...

...
If you have thoughts please don't hesitate to share on this thread. Very sorry we're hitting these failures and I really appreciate your patience. I would expect major delays in getting patches merged at this point until things are resolved.

Thank you!

[1] HTTPError: 429 Client Error: Too Many Requests for url: http://mirror.ca-ymq-1.vexxhost.opendev.org:8082/v2/tripleotrain/centos-bina... [2] https://bugs.launchpad.net/tripleo/+bug/1889122

-- Emilien Macchi

Emilien Macchi

1:23 p.m.

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com> wrote:

...

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com> wrote:

...
On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com>

...
...
FYI...

If you find your jobs are failing with an error similar to [1], you

have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers.

...
There are a few ways to mitigate the issue however I don't see any of

wrote: the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved.

...
...
For full transparency we're considering the following options.

1. move off of docker.io to quay.io

quay.io also has API rate limit: https://docs.quay.io/issues/429.html

Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. Also quay.io had its big downtimes as well, SLA needs to be considered.

...
2. local container builds for each job in master, possibly ussuri

Not convinced. You can look at CI logs: - pulling / updating / pushing container images from docker.io to local registry takes ~10 min on standalone (OVH) - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH).

...
3. parent child jobs upstream where rpms and containers will be build

and host artifacts for the child jobs

Yes, we need to investigate that.

...
4. remove some portion of the upstream jobs to lower the impact we have

on 3rd party infrastructure.

I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the jobs). Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now. We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more). -- Emilien Macchi

Wesley Hayutin

4:09 p.m.

On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com> wrote:

...

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com> wrote:

...
On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com> wrote:

...
On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com>

...
...
FYI...

If you find your jobs are failing with an error similar to [1], you

have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers.

...
There are a few ways to mitigate the issue however I don't see any of

wrote: the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved.

...
...
For full transparency we're considering the following options.

1. move off of docker.io to quay.io

quay.io also has API rate limit: https://docs.quay.io/issues/429.html

Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. Also quay.io had its big downtimes as well, SLA needs to be considered.

...
2. local container builds for each job in master, possibly ussuri

Not convinced. You can look at CI logs: - pulling / updating / pushing container images from docker.io to local registry takes ~10 min on standalone (OVH) - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH).

...
3. parent child jobs upstream where rpms and containers will be build

and host artifacts for the child jobs

Yes, we need to investigate that.

...
4. remove some portion of the upstream jobs to lower the impact we

have on 3rd party infrastructure.

I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the jobs). Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien +1 to reviewing the catalog and adjusting things on an ongoing basis. All.. it looks like the issues with docker.io were more of a flare up than a change in docker.io policy or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots. I've socialized the issue with the CI team and some ways to reduce our reliance on docker.io or any public registry. Sagi and I have a draft design that we'll share on this list after a first round of a POC. We also thought we'd leverage Emilien's awesome work [1] to build containers locally in standalone for widely to reduce our traffic to docker.io and upstream proxies. TLDR, feel free to recheck and wf. Thanks for your patience!! [1] https://review.opendev.org/#/q/status:open++topic:dos_docker.io [2] link to logstash query be sure to change the time range <http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1>

Bogdan Dobrelya

29 Jul 29 Jul

8:25 a.m.

On 7/28/20 6:09 PM, Wesley Hayutin wrote:

...

On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> FYI... >> >> If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io <http://docker.io> via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers. >> >> There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. >> >> For full transparency we're considering the following options. >> >> 1. move off of docker.io <http://docker.io> to quay.io <http://quay.io> > > > quay.io <http://quay.io> also has API rate limit: > https://docs.quay.io/issues/429.html > > Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. > Also quay.io <http://quay.io> had its big downtimes as well, SLA needs to be considered. > >> 2. local container builds for each job in master, possibly ussuri > > > Not convinced. > You can look at CI logs: > - pulling / updating / pushing container images from docker.io <http://docker.io> to local registry takes ~10 min on standalone (OVH) > - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH). > >> >> 3. parent child jobs upstream where rpms and containers will be build and host artifacts for the child jobs > > > Yes, we need to investigate that. > >> >> 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. > > > I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the jobs). Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing basis.

All.. it looks like the issues with docker.io <http://docker.io> were more of a flare up than a change in docker.io <http://docker.io> policy or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1]. tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110: Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417 which seems too much of (workers*reconnects) and triggers rate limiting immediately. [0] https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... [1] https://bugs.launchpad.net/tripleo/+bug/1889372 -- Best regards, Bogdan Dobrelya, Irc #bogdando

Wesley Hayutin

1:13 p.m.

On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...

On 7/28/20 6:09 PM, Wesley Hayutin wrote:

...
On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> FYI... >> >> If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io <http://docker.io> via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers. >> >> There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. >> >> For full transparency we're considering the following

options.

...
>> >> 1. move off of docker.io <http://docker.io> to quay.io <http://quay.io> > > > quay.io <http://quay.io> also has API rate limit: > https://docs.quay.io/issues/429.html > > Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. > Also quay.io <http://quay.io> had its big downtimes as well, SLA needs to be considered. > >> 2. local container builds for each job in master, possibly ussuri > > > Not convinced. > You can look at CI logs: > - pulling / updating / pushing container images from docker.io <http://docker.io> to local registry takes ~10 min on standalone (OVH) > - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH). > >> >> 3. parent child jobs upstream where rpms and containers will be build and host artifacts for the child jobs > > > Yes, we need to investigate that. > >> >> 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. > > > I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into

reducing

...
what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the

jobs).

...
Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing basis.

All.. it looks like the issues with docker.io <http://docker.io> were more of a flare up than a change in docker.io <http://docker.io> policy or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417

which seems too much of (workers*reconnects) and triggers rate limiting immediately.

[0]

https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd...

[1] https://bugs.launchpad.net/tripleo/+bug/1889372

-- Best regards, Bogdan Dobrelya, Irc #bogdando

FYI.. The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Alex Schultz

10:33 p.m.

On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...

On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...
On 7/28/20 6:09 PM, Wesley Hayutin wrote:

...
On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> FYI... >> >> If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io <http://docker.io> via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers. >> >> There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. >> >> For full transparency we're considering the following options. >> >> 1. move off of docker.io <http://docker.io> to quay.io <http://quay.io> > > > quay.io <http://quay.io> also has API rate limit: > https://docs.quay.io/issues/429.html > > Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. > Also quay.io <http://quay.io> had its big downtimes as well, SLA needs to be considered. > >> 2. local container builds for each job in master, possibly ussuri > > > Not convinced. > You can look at CI logs: > - pulling / updating / pushing container images from docker.io <http://docker.io> to local registry takes ~10 min on standalone (OVH) > - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH). > >> >> 3. parent child jobs upstream where rpms and containers will be build and host artifacts for the child jobs > > > Yes, we need to investigate that. > >> >> 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. > > > I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the jobs). Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing basis.

All.. it looks like the issues with docker.io <http://docker.io> were more of a flare up than a change in docker.io <http://docker.io> policy or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417

which seems too much of (workers*reconnects) and triggers rate limiting immediately.

[0] https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd...

[1] https://bugs.launchpad.net/tripleo/+bug/1889372

-- Best regards, Bogdan Dobrelya, Irc #bogdando

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Wesley Hayutin

10:48 p.m.

On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...

On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com>

...
...
On 7/28/20 6:09 PM, Wesley Hayutin wrote:

...
On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> FYI... >> >> If you find your jobs are failing with an error similar to [1], you have been rate limited by docker.io <

http://docker.io>

...
...
via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers. >> >> There are a few ways to mitigate the issue however I don't see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. >> >> For full transparency we're considering the following

...
...
...
>> >> 1. move off of docker.io <http://docker.io> to quay.io <http://quay.io> > > > quay.io <http://quay.io> also has API rate limit: > https://docs.quay.io/issues/429.html > > Now I'm not sure about how many requests per seconds one

can

...
do vs the other but this would need to be checked with the

quay

...
team before changing anything. > Also quay.io <http://quay.io> had its big downtimes as

well,

...
SLA needs to be considered. > >> 2. local container builds for each job in master, possibly ussuri > > > Not convinced. > You can look at CI logs: > - pulling / updating / pushing container images from docker.io <http://docker.io> to local registry takes ~10 min

on

...
standalone (OVH) > - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone

(OVH).

...
> >> >> 3. parent child jobs upstream where rpms and containers

will

...
be build and host artifacts for the child jobs > > > Yes, we need to investigate that. > >> >> 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. > > > I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of

resource

...
constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into

reducing

...
what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the

jobs).

...
Yes big +1, this should be a continuous goal when working on CI,

and

...
always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010

but

...
I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing basis.

All.. it looks like the issues with docker.io <http://docker.io> were more of a flare up than a change in docker.io <http://docker.io>

wrote: options. policy

...
...
...
or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417

which seems too much of (workers*reconnects) and triggers rate limiting immediately.

[0]

https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd...

...
[1] https://bugs.launchpad.net/tripleo/+bug/1889372

-- Best regards, Bogdan Dobrelya, Irc #bogdando

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

Wesley Hayutin

5 Aug 5 Aug

4:23 p.m.

On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...

On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com>

...
...
On 7/28/20 6:09 PM, Wesley Hayutin wrote:

...
On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <

aschultz@redhat.com

...
...
<mailto:aschultz@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> FYI... >> >> If you find your jobs are failing with an error similar

to

...
[1], you have been rate limited by docker.io <

http://docker.io>

...
via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few

CI

...
engineers. >> >> There are a few ways to mitigate the issue however I

don't

...
see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. >> >> For full transparency we're considering the following

...
...
...
>> >> 1. move off of docker.io <http://docker.io> to quay.io <http://quay.io> > > > quay.io <http://quay.io> also has API rate limit: > https://docs.quay.io/issues/429.html > > Now I'm not sure about how many requests per seconds one

can

...
do vs the other but this would need to be checked with the

quay

...
team before changing anything. > Also quay.io <http://quay.io> had its big downtimes as

well,

...
SLA needs to be considered. > >> 2. local container builds for each job in master,

...
...
...
ussuri > > > Not convinced. > You can look at CI logs: > - pulling / updating / pushing container images from docker.io <http://docker.io> to local registry takes ~10

min on

...
standalone (OVH) > - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone

(OVH).

...
> >> >> 3. parent child jobs upstream where rpms and containers

will

...
be build and host artifacts for the child jobs > > > Yes, we need to investigate that. > >> >> 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. > > > I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of

resource

...
constraints. I think we've added new jobs recently and

...
...
...
need to reduce what we run. Additionally we might want to look into

reducing

...
what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the

jobs).

...
Yes big +1, this should be a continuous goal when working on CI,

and

...
always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010

but

...
I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing basis.

All.. it looks like the issues with docker.io <http://docker.io>

were

...
more of a flare up than a change in docker.io <http://docker.io>

wrote: options. possibly likely policy

...
...
...
or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417

which seems too much of (workers*reconnects) and triggers rate limiting immediately.

[0]

https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd...

...
[1] https://bugs.launchpad.net/tripleo/+bug/1889372

-- Best regards, Bogdan Dobrelya, Irc #bogdando

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+... expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

Luke Short

19 Aug 19 Aug

1:15 p.m.

Hey folks, All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated. Sincerely, Luke Short On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...

On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com>

...
...
On 7/28/20 6:09 PM, Wesley Hayutin wrote:

...
On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <

aschultz@redhat.com

...
...
<mailto:aschultz@redhat.com>> wrote:

On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> FYI... >> >> If you find your jobs are failing with an error similar

to

...
[1], you have been rate limited by docker.io <

http://docker.io>

...
via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few

CI

...
engineers. >> >> There are a few ways to mitigate the issue however I

don't

...
see any of the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved. >> >> For full transparency we're considering the following

...
...
...
>> >> 1. move off of docker.io <http://docker.io> to quay.io <http://quay.io> > > > quay.io <http://quay.io> also has API rate limit: > https://docs.quay.io/issues/429.html > > Now I'm not sure about how many requests per seconds one

can

...
do vs the other but this would need to be checked with the

quay

...
team before changing anything. > Also quay.io <http://quay.io> had its big downtimes as

well,

...
SLA needs to be considered. > >> 2. local container builds for each job in master,

...
...
...
ussuri > > > Not convinced. > You can look at CI logs: > - pulling / updating / pushing container images from docker.io <http://docker.io> to local registry takes ~10

min on

...
standalone (OVH) > - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone

(OVH).

...
> >> >> 3. parent child jobs upstream where rpms and containers

will

...
be build and host artifacts for the child jobs > > > Yes, we need to investigate that. > >> >> 4. remove some portion of the upstream jobs to lower the impact we have on 3rd party infrastructure. > > > I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have

two

...
scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of

resource

...
constraints. I think we've added new jobs recently and

...
...
...
need to reduce what we run. Additionally we might want to look into

reducing

...
what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of

...
...
...
Yes big +1, this should be a continuous goal when working on

CI, and

...
always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing

(e.g.

...
deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010

but

...
I'm sure there are more). -- Emilien Macchi

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing

basis.

...
All.. it looks like the issues with docker.io <http://docker.io>

were

...
more of a flare up than a change in docker.io <http://docker.io>

...
...
...
or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417

which seems too much of (workers*reconnects) and triggers rate

wrote: options. possibly likely the jobs). policy limiting

...
...
immediately.

[0]

https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd...

...
[1] https://bugs.launchpad.net/tripleo/+bug/1889372

-- Best regards, Bogdan Dobrelya, Irc #bogdando

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

Alex Schultz

1:23 p.m.

On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:

...

Hey folks,

All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated.

I'm looking at this today but from what I can tell the problem is likely caused by a reduced anonymous query quota from docker.io and our usage of the upstream mirrors. Because the mirrors essentially funnel all requests through a single IP we're hitting limits faster than if we didn't use the mirrors. Due to the nature of the requests, the metadata queries don't get cached due to the authorization header but are subject to the rate limiting. Additionally we're querying the registry to determine which containers we need to update in CI because we limit our updates to a certain set of containers as part of the CI jobs. So there are likely a few different steps forward on this and we can do a few of these together. 1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa). 2) reduce the number of jobs 3) stop querying the registry for the update filters (i'm looking into this today) and use the information in tripleo-common first. 4) build containers always instead of fetching from docker.io Thanks, -Alex

...

Sincerely, Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...
On 7/28/20 6:09 PM, Wesley Hayutin wrote: > > > On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com > <mailto:emilien@redhat.com>> wrote: > > > > On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com > <mailto:aschultz@redhat.com>> wrote: > > On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi > <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: > > > > > > > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin > <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: > >> > >> FYI... > >> > >> If you find your jobs are failing with an error similar to > [1], you have been rate limited by docker.io <http://docker.io> > via the upstream mirror system and have hit [2]. I've been > discussing the issue w/ upstream infra, rdo-infra and a few CI > engineers. > >> > >> There are a few ways to mitigate the issue however I don't > see any of the options being completed very quickly so I'm > asking for your patience while this issue is socialized and > resolved. > >> > >> For full transparency we're considering the following options. > >> > >> 1. move off of docker.io <http://docker.io> to quay.io > <http://quay.io> > > > > > > quay.io <http://quay.io> also has API rate limit: > > https://docs.quay.io/issues/429.html > > > > Now I'm not sure about how many requests per seconds one can > do vs the other but this would need to be checked with the quay > team before changing anything. > > Also quay.io <http://quay.io> had its big downtimes as well, > SLA needs to be considered. > > > >> 2. local container builds for each job in master, possibly > ussuri > > > > > > Not convinced. > > You can look at CI logs: > > - pulling / updating / pushing container images from > docker.io <http://docker.io> to local registry takes ~10 min on > standalone (OVH) > > - building containers from scratch with updated repos and > pushing them to local registry takes ~29 min on standalone (OVH). > > > >> > >> 3. parent child jobs upstream where rpms and containers will > be build and host artifacts for the child jobs > > > > > > Yes, we need to investigate that. > > > >> > >> 4. remove some portion of the upstream jobs to lower the > impact we have on 3rd party infrastructure. > > > > > > I'm not sure I understand this one, maybe you can give an > example of what could be removed? > > We need to re-evaulate our use of scenarios (e.g. we have two > scenario010's both are non-voting). There's a reason we > historically > didn't want to add more jobs because of these types of resource > constraints. I think we've added new jobs recently and likely > need to > reduce what we run. Additionally we might want to look into reducing > what we run on stable branches as well. > > > Oh... removing jobs (I thought we would remove some steps of the jobs). > Yes big +1, this should be a continuous goal when working on CI, and > always evaluating what we need vs what we run now. > > We should look at: > 1) services deployed in scenarios that aren't worth testing (e.g. > deprecated or unused things) (and deprecate the unused things) > 2) jobs themselves (I don't have any example beside scenario010 but > I'm sure there are more). > -- > Emilien Macchi > > > Thanks Alex, Emilien > > +1 to reviewing the catalog and adjusting things on an ongoing basis. > > All.. it looks like the issues with docker.io <http://docker.io> were > more of a flare up than a change in docker.io <http://docker.io> policy > or infrastructure [2]. The flare up started on July 27 8am utc and > ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback intervals should be also adjusted. I've analysed the log snippet [0] for the connection reset counts by workers versus the times the rate limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID: 3 58412 2 58413 3 58415 3 58417

which seems too much of (workers*reconnects) and triggers rate limiting immediately.

[0] https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd...

[1] https://bugs.launchpad.net/tripleo/+bug/1889372

-- Best regards, Bogdan Dobrelya, Irc #bogdando

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

Cédric Jeanneret

1:40 p.m.

On 8/19/20 3:23 PM, Alex Schultz wrote:

...

On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:

...
Hey folks,

All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated.

I'm looking at this today but from what I can tell the problem is likely caused by a reduced anonymous query quota from docker.io and our usage of the upstream mirrors. Because the mirrors essentially funnel all requests through a single IP we're hitting limits faster than if we didn't use the mirrors. Due to the nature of the requests, the metadata queries don't get cached due to the authorization header but are subject to the rate limiting. Additionally we're querying the registry to determine which containers we need to update in CI because we limit our updates to a certain set of containers as part of the CI jobs.

So there are likely a few different steps forward on this and we can do a few of these together.

1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

might be good, but it might lead to some other issues - docker might want to rate-limit on container owner. I wouldn't be surprised if they go that way in the future. Could be OK as a first "unlocking step". But we should consider 2) and 3).

...

2) reduce the number of jobs

always a good thing to do, +1

...

3) stop querying the registry for the update filters (i'm looking into this today) and use the information in tripleo-common first.

+1 - thanks for looking into it!

...

4) build containers always instead of fetching from docker.io

meh... last resort, if really nothing else works... It's time consuming and will lead to other issues within the CI (job timeout and the like), wouldn't it?

...

Thanks, -Alex

...
Sincerely, Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote: > > On 7/28/20 6:09 PM, Wesley Hayutin wrote: >> >> >> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com >> <mailto:emilien@redhat.com>> wrote: >> >> >> >> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com >> <mailto:aschultz@redhat.com>> wrote: >> >> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi >> <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: >> > >> > >> > >> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin >> <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> >> >> FYI... >> >> >> >> If you find your jobs are failing with an error similar to >> [1], you have been rate limited by docker.io <http://docker.io> >> via the upstream mirror system and have hit [2]. I've been >> discussing the issue w/ upstream infra, rdo-infra and a few CI >> engineers. >> >> >> >> There are a few ways to mitigate the issue however I don't >> see any of the options being completed very quickly so I'm >> asking for your patience while this issue is socialized and >> resolved. >> >> >> >> For full transparency we're considering the following options. >> >> >> >> 1. move off of docker.io <http://docker.io> to quay.io >> <http://quay.io> >> > >> > >> > quay.io <http://quay.io> also has API rate limit: >> > https://docs.quay.io/issues/429.html >> > >> > Now I'm not sure about how many requests per seconds one can >> do vs the other but this would need to be checked with the quay >> team before changing anything. >> > Also quay.io <http://quay.io> had its big downtimes as well, >> SLA needs to be considered. >> > >> >> 2. local container builds for each job in master, possibly >> ussuri >> > >> > >> > Not convinced. >> > You can look at CI logs: >> > - pulling / updating / pushing container images from >> docker.io <http://docker.io> to local registry takes ~10 min on >> standalone (OVH) >> > - building containers from scratch with updated repos and >> pushing them to local registry takes ~29 min on standalone (OVH). >> > >> >> >> >> 3. parent child jobs upstream where rpms and containers will >> be build and host artifacts for the child jobs >> > >> > >> > Yes, we need to investigate that. >> > >> >> >> >> 4. remove some portion of the upstream jobs to lower the >> impact we have on 3rd party infrastructure. >> > >> > >> > I'm not sure I understand this one, maybe you can give an >> example of what could be removed? >> >> We need to re-evaulate our use of scenarios (e.g. we have two >> scenario010's both are non-voting). There's a reason we >> historically >> didn't want to add more jobs because of these types of resource >> constraints. I think we've added new jobs recently and likely >> need to >> reduce what we run. Additionally we might want to look into reducing >> what we run on stable branches as well. >> >> >> Oh... removing jobs (I thought we would remove some steps of the jobs). >> Yes big +1, this should be a continuous goal when working on CI, and >> always evaluating what we need vs what we run now. >> >> We should look at: >> 1) services deployed in scenarios that aren't worth testing (e.g. >> deprecated or unused things) (and deprecate the unused things) >> 2) jobs themselves (I don't have any example beside scenario010 but >> I'm sure there are more). >> -- >> Emilien Macchi >> >> >> Thanks Alex, Emilien >> >> +1 to reviewing the catalog and adjusting things on an ongoing basis. >> >> All.. it looks like the issues with docker.io <http://docker.io> were >> more of a flare up than a change in docker.io <http://docker.io> policy >> or infrastructure [2]. The flare up started on July 27 8am utc and >> ended on July 27 17:00 utc, see screenshots. > > The numbers of image prepare workers and its exponential fallback > intervals should be also adjusted. I've analysed the log snippet [0] for > the connection reset counts by workers versus the times the rate > limiting was triggered. See the details in the reported bug [1]. > > tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110: > > Conn Reset Counts by a Worker PID: > 3 58412 > 2 58413 > 3 58415 > 3 58417 > > which seems too much of (workers*reconnects) and triggers rate limiting > immediately. > > [0] > https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... > > [1] https://bugs.launchpad.net/tripleo/+bug/1889372 > > -- > Best regards, > Bogdan Dobrelya, > Irc #bogdando >

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

-- Cédric Jeanneret (He/Him/His) Sr. Software Engineer - OpenStack Platform Deployment Framework TC Red Hat EMEA https://www.redhat.com/

Jeremy Stanley

2:53 p.m.

On 2020-08-19 15:40:08 +0200 (+0200), Cédric Jeanneret wrote:

...

On 8/19/20 3:23 PM, Alex Schultz wrote: [...]

...
1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

might be good, but it might lead to some other issues - docker might want to rate-limit on container owner. I wouldn't be surprised if they go that way in the future. Could be OK as a first "unlocking step". [...]

Be aware that there is another side effect: right now the images are being served from a cache within the same environment as the test nodes, and instead your jobs will begin fetching them over the Internet. This may mean longer average job run time, and a higher percentage of download failures due to network hiccups (whether these will be of a greater frequency than the API rate limit blocking, it's hard to guess). It also necessarily means significantly more bandwidth utilization for our resource donors, particularly as TripleO consumes far more job resources than any other project already. I wonder if there's a middle ground: finding a way to use the cache for fetching images, but connecting straight to Dockerhub when you're querying metadata? It sounds like the metadata requests represent a majority of the actual Dockerhub API calls anyway, and can't be cached regardless. -- Jeremy Stanley

Alex Schultz

3:14 p.m.

On Wed, Aug 19, 2020 at 8:59 AM Jeremy Stanley <fungi@yuggoth.org> wrote:

...

On 2020-08-19 15:40:08 +0200 (+0200), Cédric Jeanneret wrote:

...
On 8/19/20 3:23 PM, Alex Schultz wrote: [...]

...
1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

might be good, but it might lead to some other issues - docker might want to rate-limit on container owner. I wouldn't be surprised if they go that way in the future. Could be OK as a first "unlocking step". [...]

Be aware that there is another side effect: right now the images are being served from a cache within the same environment as the test nodes, and instead your jobs will begin fetching them over the Internet. This may mean longer average job run time, and a higher percentage of download failures due to network hiccups (whether these will be of a greater frequency than the API rate limit blocking, it's hard to guess). It also necessarily means significantly more bandwidth utilization for our resource donors, particularly as TripleO consumes far more job resources than any other project already.

Yea I know so we're trying to find a solution that doesn't make it worse. It would be great if we could have any visibility into the cache hit ratio/requests going through these mirrors to know if we have changes that are improving things or making it worse.

...

I wonder if there's a middle ground: finding a way to use the cache for fetching images, but connecting straight to Dockerhub when you're querying metadata? It sounds like the metadata requests represent a majority of the actual Dockerhub API calls anyway, and can't be cached regardless.

Maybe, but at the moment i'm working on not even doing the requests at all which would be better. Next i'll look into that but the mirror config is handled before we even start requesting things

...

-- Jeremy Stanley

Jeremy Stanley

3:45 p.m.

New subject: [tripleo][ci]i[infra] container pulls failing

On 2020-08-19 09:14:27 -0600 (-0600), Alex Schultz wrote: [...]

...

It would be great if we could have any visibility into the cache hit ratio/requests going through these mirrors to know if we have changes that are improving things or making it worse. [...]

Normally we avoid publishing raw Web server logs to protect the privacy of our users, but in this case we might make an exception because the mirrors are only intended for use by our public Zuul jobs and Nodepool image builds. It's worth bringing up with the rest of the team, for sure. -- Jeremy Stanley

Bogdan Dobrelya

1:53 p.m.

On 8/19/20 3:23 PM, Alex Schultz wrote:

...

On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:

...
Hey folks,

All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated.

I'm looking at this today but from what I can tell the problem is likely caused by a reduced anonymous query quota from docker.io and our usage of the upstream mirrors. Because the mirrors essentially funnel all requests through a single IP we're hitting limits faster than if we didn't use the mirrors. Due to the nature of the requests, the metadata queries don't get cached due to the authorization header but are subject to the rate limiting. Additionally we're querying the registry to determine which containers we need to update in CI because we limit our updates to a certain set of containers as part of the CI jobs.

So there are likely a few different steps forward on this and we can do a few of these together.

1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

Also, the stable/(N-1) branch could use quay.io, while master keeps using docker.io (assuming containers for that N-1 release will be hosted there instead of the dockerhub)

...

2) reduce the number of jobs 3) stop querying the registry for the update filters (i'm looking into this today) and use the information in tripleo-common first. 4) build containers always instead of fetching from docker.io

Thanks, -Alex

...
Sincerely, Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote: > > On 7/28/20 6:09 PM, Wesley Hayutin wrote: >> >> >> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com >> <mailto:emilien@redhat.com>> wrote: >> >> >> >> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com >> <mailto:aschultz@redhat.com>> wrote: >> >> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi >> <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: >> > >> > >> > >> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin >> <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >> >> >> >> FYI... >> >> >> >> If you find your jobs are failing with an error similar to >> [1], you have been rate limited by docker.io <http://docker.io> >> via the upstream mirror system and have hit [2]. I've been >> discussing the issue w/ upstream infra, rdo-infra and a few CI >> engineers. >> >> >> >> There are a few ways to mitigate the issue however I don't >> see any of the options being completed very quickly so I'm >> asking for your patience while this issue is socialized and >> resolved. >> >> >> >> For full transparency we're considering the following options. >> >> >> >> 1. move off of docker.io <http://docker.io> to quay.io >> <http://quay.io> >> > >> > >> > quay.io <http://quay.io> also has API rate limit: >> > https://docs.quay.io/issues/429.html >> > >> > Now I'm not sure about how many requests per seconds one can >> do vs the other but this would need to be checked with the quay >> team before changing anything. >> > Also quay.io <http://quay.io> had its big downtimes as well, >> SLA needs to be considered. >> > >> >> 2. local container builds for each job in master, possibly >> ussuri >> > >> > >> > Not convinced. >> > You can look at CI logs: >> > - pulling / updating / pushing container images from >> docker.io <http://docker.io> to local registry takes ~10 min on >> standalone (OVH) >> > - building containers from scratch with updated repos and >> pushing them to local registry takes ~29 min on standalone (OVH). >> > >> >> >> >> 3. parent child jobs upstream where rpms and containers will >> be build and host artifacts for the child jobs >> > >> > >> > Yes, we need to investigate that. >> > >> >> >> >> 4. remove some portion of the upstream jobs to lower the >> impact we have on 3rd party infrastructure. >> > >> > >> > I'm not sure I understand this one, maybe you can give an >> example of what could be removed? >> >> We need to re-evaulate our use of scenarios (e.g. we have two >> scenario010's both are non-voting). There's a reason we >> historically >> didn't want to add more jobs because of these types of resource >> constraints. I think we've added new jobs recently and likely >> need to >> reduce what we run. Additionally we might want to look into reducing >> what we run on stable branches as well. >> >> >> Oh... removing jobs (I thought we would remove some steps of the jobs). >> Yes big +1, this should be a continuous goal when working on CI, and >> always evaluating what we need vs what we run now. >> >> We should look at: >> 1) services deployed in scenarios that aren't worth testing (e.g. >> deprecated or unused things) (and deprecate the unused things) >> 2) jobs themselves (I don't have any example beside scenario010 but >> I'm sure there are more). >> -- >> Emilien Macchi >> >> >> Thanks Alex, Emilien >> >> +1 to reviewing the catalog and adjusting things on an ongoing basis. >> >> All.. it looks like the issues with docker.io <http://docker.io> were >> more of a flare up than a change in docker.io <http://docker.io> policy >> or infrastructure [2]. The flare up started on July 27 8am utc and >> ended on July 27 17:00 utc, see screenshots. > > The numbers of image prepare workers and its exponential fallback > intervals should be also adjusted. I've analysed the log snippet [0] for > the connection reset counts by workers versus the times the rate > limiting was triggered. See the details in the reported bug [1]. > > tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110: > > Conn Reset Counts by a Worker PID: > 3 58412 > 2 58413 > 3 58415 > 3 58417 > > which seems too much of (workers*reconnects) and triggers rate limiting > immediately. > > [0] > https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... > > [1] https://bugs.launchpad.net/tripleo/+bug/1889372 > > -- > Best regards, > Bogdan Dobrelya, > Irc #bogdando >

FYI..

The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Alex Schultz

1:55 p.m.

On Wed, Aug 19, 2020 at 7:53 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...

On 8/19/20 3:23 PM, Alex Schultz wrote:

...
On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:

...
Hey folks,

All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated.

I'm looking at this today but from what I can tell the problem is likely caused by a reduced anonymous query quota from docker.io and our usage of the upstream mirrors. Because the mirrors essentially funnel all requests through a single IP we're hitting limits faster than if we didn't use the mirrors. Due to the nature of the requests, the metadata queries don't get cached due to the authorization header but are subject to the rate limiting. Additionally we're querying the registry to determine which containers we need to update in CI because we limit our updates to a certain set of containers as part of the CI jobs.

So there are likely a few different steps forward on this and we can do a few of these together.

1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

Also, the stable/(N-1) branch could use quay.io, while master keeps using docker.io (assuming containers for that N-1 release will be hosted there instead of the dockerhub)

quay has its own limits and likely will suffer from a similar problem.

...

...
2) reduce the number of jobs 3) stop querying the registry for the update filters (i'm looking into this today) and use the information in tripleo-common first. 4) build containers always instead of fetching from docker.io

Thanks, -Alex

...
Sincerely, Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote: > > > > On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote: >> >> On 7/28/20 6:09 PM, Wesley Hayutin wrote: >>> >>> >>> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com >>> <mailto:emilien@redhat.com>> wrote: >>> >>> >>> >>> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com >>> <mailto:aschultz@redhat.com>> wrote: >>> >>> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi >>> <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: >>> > >>> > >>> > >>> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin >>> <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >>> >> >>> >> FYI... >>> >> >>> >> If you find your jobs are failing with an error similar to >>> [1], you have been rate limited by docker.io <http://docker.io> >>> via the upstream mirror system and have hit [2]. I've been >>> discussing the issue w/ upstream infra, rdo-infra and a few CI >>> engineers. >>> >> >>> >> There are a few ways to mitigate the issue however I don't >>> see any of the options being completed very quickly so I'm >>> asking for your patience while this issue is socialized and >>> resolved. >>> >> >>> >> For full transparency we're considering the following options. >>> >> >>> >> 1. move off of docker.io <http://docker.io> to quay.io >>> <http://quay.io> >>> > >>> > >>> > quay.io <http://quay.io> also has API rate limit: >>> > https://docs.quay.io/issues/429.html >>> > >>> > Now I'm not sure about how many requests per seconds one can >>> do vs the other but this would need to be checked with the quay >>> team before changing anything. >>> > Also quay.io <http://quay.io> had its big downtimes as well, >>> SLA needs to be considered. >>> > >>> >> 2. local container builds for each job in master, possibly >>> ussuri >>> > >>> > >>> > Not convinced. >>> > You can look at CI logs: >>> > - pulling / updating / pushing container images from >>> docker.io <http://docker.io> to local registry takes ~10 min on >>> standalone (OVH) >>> > - building containers from scratch with updated repos and >>> pushing them to local registry takes ~29 min on standalone (OVH). >>> > >>> >> >>> >> 3. parent child jobs upstream where rpms and containers will >>> be build and host artifacts for the child jobs >>> > >>> > >>> > Yes, we need to investigate that. >>> > >>> >> >>> >> 4. remove some portion of the upstream jobs to lower the >>> impact we have on 3rd party infrastructure. >>> > >>> > >>> > I'm not sure I understand this one, maybe you can give an >>> example of what could be removed? >>> >>> We need to re-evaulate our use of scenarios (e.g. we have two >>> scenario010's both are non-voting). There's a reason we >>> historically >>> didn't want to add more jobs because of these types of resource >>> constraints. I think we've added new jobs recently and likely >>> need to >>> reduce what we run. Additionally we might want to look into reducing >>> what we run on stable branches as well. >>> >>> >>> Oh... removing jobs (I thought we would remove some steps of the jobs). >>> Yes big +1, this should be a continuous goal when working on CI, and >>> always evaluating what we need vs what we run now. >>> >>> We should look at: >>> 1) services deployed in scenarios that aren't worth testing (e.g. >>> deprecated or unused things) (and deprecate the unused things) >>> 2) jobs themselves (I don't have any example beside scenario010 but >>> I'm sure there are more). >>> -- >>> Emilien Macchi >>> >>> >>> Thanks Alex, Emilien >>> >>> +1 to reviewing the catalog and adjusting things on an ongoing basis. >>> >>> All.. it looks like the issues with docker.io <http://docker.io> were >>> more of a flare up than a change in docker.io <http://docker.io> policy >>> or infrastructure [2]. The flare up started on July 27 8am utc and >>> ended on July 27 17:00 utc, see screenshots. >> >> The numbers of image prepare workers and its exponential fallback >> intervals should be also adjusted. I've analysed the log snippet [0] for >> the connection reset counts by workers versus the times the rate >> limiting was triggered. See the details in the reported bug [1]. >> >> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110: >> >> Conn Reset Counts by a Worker PID: >> 3 58412 >> 2 58413 >> 3 58415 >> 3 58417 >> >> which seems too much of (workers*reconnects) and triggers rate limiting >> immediately. >> >> [0] >> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... >> >> [1] https://bugs.launchpad.net/tripleo/+bug/1889372 >> >> -- >> Best regards, >> Bogdan Dobrelya, >> Irc #bogdando >> > > FYI.. > > The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue.

Working with the infra folks and we have identified the authorization header as causing issues when we're rediected from docker.io to cloudflare. I'll throw up a patch tomorrow to handle this case which should improve our usage of the cache. It needs some testing against other registries to ensure that we don't break authenticated fetching of resources.

Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Bogdan Dobrelya

2:31 p.m.

On 8/19/20 3:55 PM, Alex Schultz wrote:

...

On Wed, Aug 19, 2020 at 7:53 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...
On 8/19/20 3:23 PM, Alex Schultz wrote:

...
On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:

...
Hey folks,

All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated.

I'm looking at this today but from what I can tell the problem is likely caused by a reduced anonymous query quota from docker.io and our usage of the upstream mirrors. Because the mirrors essentially funnel all requests through a single IP we're hitting limits faster than if we didn't use the mirrors. Due to the nature of the requests, the metadata queries don't get cached due to the authorization header but are subject to the rate limiting. Additionally we're querying the registry to determine which containers we need to update in CI because we limit our updates to a certain set of containers as part of the CI jobs.

So there are likely a few different steps forward on this and we can do a few of these together.

1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

Also, the stable/(N-1) branch could use quay.io, while master keeps using docker.io (assuming containers for that N-1 release will be hosted there instead of the dockerhub)

quay has its own limits and likely will suffer from a similar problem.

Right. But dropped numbers of total requests sent to each registry could end up with less often rate limiting by either of two.

...

...
...
2) reduce the number of jobs 3) stop querying the registry for the update filters (i'm looking into this today) and use the information in tripleo-common first. 4) build containers always instead of fetching from docker.io

There may be a middle-ground solution. Building it only once for each patchset executed in TripleO Zuul pipelines. Transient images, like [0], that can have TTL and self-expire should be used for that purpose. [0] https://idbs-engineering.com/containers/2019/08/27/auto-expiry-quayio-tags.h... That would require the zuul jobs with dependencies passing ansible variables to each other, by the execution results. Can that be done? Pretty much like we have it already set in TripleO for tox jobs as a dependency for standalone/multinode jobs. But adding an extra step to prepare such a transient pack of the container images (only to be used for that patchset) and push it to a quay registry hosted elsewhere by TripleO devops folks. Then the jobs that have that dependency met can use those transient images via an ansible variable passed for the jobs. Auto expiration solves the space/lifecycle requirements for the cloud that will be hosting that registry.

...

...
...
Thanks, -Alex

...
Sincerely, Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz@redhat.com> wrote: > > On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin@redhat.com> wrote: >> >> >> >> On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote: >>> >>> On 7/28/20 6:09 PM, Wesley Hayutin wrote: >>>> >>>> >>>> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com >>>> <mailto:emilien@redhat.com>> wrote: >>>> >>>> >>>> >>>> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com >>>> <mailto:aschultz@redhat.com>> wrote: >>>> >>>> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi >>>> <emilien@redhat.com <mailto:emilien@redhat.com>> wrote: >>>> > >>>> > >>>> > >>>> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin >>>> <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote: >>>> >> >>>> >> FYI... >>>> >> >>>> >> If you find your jobs are failing with an error similar to >>>> [1], you have been rate limited by docker.io <http://docker.io> >>>> via the upstream mirror system and have hit [2]. I've been >>>> discussing the issue w/ upstream infra, rdo-infra and a few CI >>>> engineers. >>>> >> >>>> >> There are a few ways to mitigate the issue however I don't >>>> see any of the options being completed very quickly so I'm >>>> asking for your patience while this issue is socialized and >>>> resolved. >>>> >> >>>> >> For full transparency we're considering the following options. >>>> >> >>>> >> 1. move off of docker.io <http://docker.io> to quay.io >>>> <http://quay.io> >>>> > >>>> > >>>> > quay.io <http://quay.io> also has API rate limit: >>>> > https://docs.quay.io/issues/429.html >>>> > >>>> > Now I'm not sure about how many requests per seconds one can >>>> do vs the other but this would need to be checked with the quay >>>> team before changing anything. >>>> > Also quay.io <http://quay.io> had its big downtimes as well, >>>> SLA needs to be considered. >>>> > >>>> >> 2. local container builds for each job in master, possibly >>>> ussuri >>>> > >>>> > >>>> > Not convinced. >>>> > You can look at CI logs: >>>> > - pulling / updating / pushing container images from >>>> docker.io <http://docker.io> to local registry takes ~10 min on >>>> standalone (OVH) >>>> > - building containers from scratch with updated repos and >>>> pushing them to local registry takes ~29 min on standalone (OVH). >>>> > >>>> >> >>>> >> 3. parent child jobs upstream where rpms and containers will >>>> be build and host artifacts for the child jobs >>>> > >>>> > >>>> > Yes, we need to investigate that. >>>> > >>>> >> >>>> >> 4. remove some portion of the upstream jobs to lower the >>>> impact we have on 3rd party infrastructure. >>>> > >>>> > >>>> > I'm not sure I understand this one, maybe you can give an >>>> example of what could be removed? >>>> >>>> We need to re-evaulate our use of scenarios (e.g. we have two >>>> scenario010's both are non-voting). There's a reason we >>>> historically >>>> didn't want to add more jobs because of these types of resource >>>> constraints. I think we've added new jobs recently and likely >>>> need to >>>> reduce what we run. Additionally we might want to look into reducing >>>> what we run on stable branches as well. >>>> >>>> >>>> Oh... removing jobs (I thought we would remove some steps of the jobs). >>>> Yes big +1, this should be a continuous goal when working on CI, and >>>> always evaluating what we need vs what we run now. >>>> >>>> We should look at: >>>> 1) services deployed in scenarios that aren't worth testing (e.g. >>>> deprecated or unused things) (and deprecate the unused things) >>>> 2) jobs themselves (I don't have any example beside scenario010 but >>>> I'm sure there are more). >>>> -- >>>> Emilien Macchi >>>> >>>> >>>> Thanks Alex, Emilien >>>> >>>> +1 to reviewing the catalog and adjusting things on an ongoing basis. >>>> >>>> All.. it looks like the issues with docker.io <http://docker.io> were >>>> more of a flare up than a change in docker.io <http://docker.io> policy >>>> or infrastructure [2]. The flare up started on July 27 8am utc and >>>> ended on July 27 17:00 utc, see screenshots. >>> >>> The numbers of image prepare workers and its exponential fallback >>> intervals should be also adjusted. I've analysed the log snippet [0] for >>> the connection reset counts by workers versus the times the rate >>> limiting was triggered. See the details in the reported bug [1]. >>> >>> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110: >>> >>> Conn Reset Counts by a Worker PID: >>> 3 58412 >>> 2 58413 >>> 3 58415 >>> 3 58417 >>> >>> which seems too much of (workers*reconnects) and triggers rate limiting >>> immediately. >>> >>> [0] >>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... >>> >>> [1] https://bugs.launchpad.net/tripleo/+bug/1889372 >>> >>> -- >>> Best regards, >>> Bogdan Dobrelya, >>> Irc #bogdando >>> >> >> FYI.. >> >> The issue w/ "too many requests" is back. Expect delays and failures in attempting to merge your patches upstream across all branches. The issue is being tracked as a critical issue. > > Working with the infra folks and we have identified the authorization > header as causing issues when we're rediected from docker.io to > cloudflare. I'll throw up a patch tomorrow to handle this case which > should improve our usage of the cache. It needs some testing against > other registries to ensure that we don't break authenticated fetching > of resources. > Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

-- Best regards, Bogdan Dobrelya, Irc #bogdando

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Bogdan Dobrelya

2:34 p.m.

On 8/19/20 4:31 PM, Bogdan Dobrelya wrote:

...

On 8/19/20 3:55 PM, Alex Schultz wrote:

...
On Wed, Aug 19, 2020 at 7:53 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...
On 8/19/20 3:23 PM, Alex Schultz wrote:

...
On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails@gmail.com> wrote:

...
Hey folks,

All of the latest patches to address this have been merged in but we are still seeing this error randomly in CI jobs that involve an Undercloud or Standalone node. As far as I can tell, the error is appearing less often than before but it is still present making merging new patches difficult. I would be happy to help work towards other possible solutions however I am unsure where to start from here. Any help would be greatly appreciated.

I'm looking at this today but from what I can tell the problem is likely caused by a reduced anonymous query quota from docker.io and our usage of the upstream mirrors. Because the mirrors essentially funnel all requests through a single IP we're hitting limits faster than if we didn't use the mirrors. Due to the nature of the requests, the metadata queries don't get cached due to the authorization header but are subject to the rate limiting. Additionally we're querying the registry to determine which containers we need to update in CI because we limit our updates to a certain set of containers as part of the CI jobs.

So there are likely a few different steps forward on this and we can do a few of these together.

1) stop using mirrors (not ideal but likely makes this go away). Alternatively switch stable branches off the mirrors due to a reduced number of executions and leave mirrors configured on master only (or vice versa).

Also, the stable/(N-1) branch could use quay.io, while master keeps using docker.io (assuming containers for that N-1 release will be hosted there instead of the dockerhub)

quay has its own limits and likely will suffer from a similar problem.

Right. But dropped numbers of total requests sent to each registry could end up with less often rate limiting by either of two.

...
...
...
2) reduce the number of jobs 3) stop querying the registry for the update filters (i'm looking into this today) and use the information in tripleo-common first. 4) build containers always instead of fetching from docker.io

There may be a middle-ground solution. Building it only once for each patchset executed in TripleO Zuul pipelines. Transient images, like [0], that can have TTL and self-expire should be used for that purpose.

[0] https://idbs-engineering.com/containers/2019/08/27/auto-expiry-quayio-tags.h...

That would require the zuul jobs with dependencies passing ansible variables to each other, by the execution results. Can that be done?

...or even simpler than that, predictable names can be created for those transient images, like <namespace>/<tag>_<patchset>

...

Pretty much like we have it already set in TripleO for tox jobs as a dependency for standalone/multinode jobs. But adding an extra step to prepare such a transient pack of the container images (only to be used for that patchset) and push it to a quay registry hosted elsewhere by TripleO devops folks.

Then the jobs that have that dependency met can use those transient images via an ansible variable passed for the jobs. Auto expiration solves the space/lifecycle requirements for the cloud that will be hosting that registry.

...
...
...
Thanks, -Alex

...
Sincerely, Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin@redhat.com> wrote:

...
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin@redhat.com> wrote: > > > > On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz > <aschultz@redhat.com> wrote: >> >> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin >> <whayutin@redhat.com> wrote: >>> >>> >>> >>> On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya >>> <bdobreli@redhat.com> wrote: >>>> >>>> On 7/28/20 6:09 PM, Wesley Hayutin wrote: >>>>> >>>>> >>>>> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi >>>>> <emilien@redhat.com >>>>> <mailto:emilien@redhat.com>> wrote: >>>>> >>>>> >>>>> >>>>> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz >>>>> <aschultz@redhat.com >>>>> <mailto:aschultz@redhat.com>> wrote: >>>>> >>>>> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi >>>>> <emilien@redhat.com <mailto:emilien@redhat.com>> >>>>> wrote: >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin >>>>> <whayutin@redhat.com <mailto:whayutin@redhat.com>> >>>>> wrote: >>>>> >> >>>>> >> FYI... >>>>> >> >>>>> >> If you find your jobs are failing with an error >>>>> similar to >>>>> [1], you have been rate limited by docker.io >>>>> <http://docker.io> >>>>> via the upstream mirror system and have hit [2]. >>>>> I've been >>>>> discussing the issue w/ upstream infra, rdo-infra >>>>> and a few CI >>>>> engineers. >>>>> >> >>>>> >> There are a few ways to mitigate the issue >>>>> however I don't >>>>> see any of the options being completed very quickly >>>>> so I'm >>>>> asking for your patience while this issue is >>>>> socialized and >>>>> resolved. >>>>> >> >>>>> >> For full transparency we're considering the >>>>> following options. >>>>> >> >>>>> >> 1. move off of docker.io <http://docker.io> to >>>>> quay.io >>>>> <http://quay.io> >>>>> > >>>>> > >>>>> > quay.io <http://quay.io> also has API rate limit: >>>>> > https://docs.quay.io/issues/429.html >>>>> > >>>>> > Now I'm not sure about how many requests per >>>>> seconds one can >>>>> do vs the other but this would need to be checked >>>>> with the quay >>>>> team before changing anything. >>>>> > Also quay.io <http://quay.io> had its big >>>>> downtimes as well, >>>>> SLA needs to be considered. >>>>> > >>>>> >> 2. local container builds for each job in >>>>> master, possibly >>>>> ussuri >>>>> > >>>>> > >>>>> > Not convinced. >>>>> > You can look at CI logs: >>>>> > - pulling / updating / pushing container images >>>>> from >>>>> docker.io <http://docker.io> to local registry >>>>> takes ~10 min on >>>>> standalone (OVH) >>>>> > - building containers from scratch with updated >>>>> repos and >>>>> pushing them to local registry takes ~29 min on >>>>> standalone (OVH). >>>>> > >>>>> >> >>>>> >> 3. parent child jobs upstream where rpms and >>>>> containers will >>>>> be build and host artifacts for the child jobs >>>>> > >>>>> > >>>>> > Yes, we need to investigate that. >>>>> > >>>>> >> >>>>> >> 4. remove some portion of the upstream jobs to >>>>> lower the >>>>> impact we have on 3rd party infrastructure. >>>>> > >>>>> > >>>>> > I'm not sure I understand this one, maybe you >>>>> can give an >>>>> example of what could be removed? >>>>> >>>>> We need to re-evaulate our use of scenarios (e.g. >>>>> we have two >>>>> scenario010's both are non-voting). There's a >>>>> reason we >>>>> historically >>>>> didn't want to add more jobs because of these types >>>>> of resource >>>>> constraints. I think we've added new jobs recently >>>>> and likely >>>>> need to >>>>> reduce what we run. Additionally we might want to >>>>> look into reducing >>>>> what we run on stable branches as well. >>>>> >>>>> >>>>> Oh... removing jobs (I thought we would remove some >>>>> steps of the jobs). >>>>> Yes big +1, this should be a continuous goal when >>>>> working on CI, and >>>>> always evaluating what we need vs what we run now. >>>>> >>>>> We should look at: >>>>> 1) services deployed in scenarios that aren't worth >>>>> testing (e.g. >>>>> deprecated or unused things) (and deprecate the unused >>>>> things) >>>>> 2) jobs themselves (I don't have any example beside >>>>> scenario010 but >>>>> I'm sure there are more). >>>>> -- >>>>> Emilien Macchi >>>>> >>>>> >>>>> Thanks Alex, Emilien >>>>> >>>>> +1 to reviewing the catalog and adjusting things on an >>>>> ongoing basis. >>>>> >>>>> All.. it looks like the issues with docker.io >>>>> <http://docker.io> were >>>>> more of a flare up than a change in docker.io >>>>> <http://docker.io> policy >>>>> or infrastructure [2]. The flare up started on July 27 8am >>>>> utc and >>>>> ended on July 27 17:00 utc, see screenshots. >>>> >>>> The numbers of image prepare workers and its exponential fallback >>>> intervals should be also adjusted. I've analysed the log >>>> snippet [0] for >>>> the connection reset counts by workers versus the times the rate >>>> limiting was triggered. See the details in the reported bug [1]. >>>> >>>> tl;dr -- for an example 5 sec interval 03:55:31,379 - >>>> 03:55:36,110: >>>> >>>> Conn Reset Counts by a Worker PID: >>>> 3 58412 >>>> 2 58413 >>>> 3 58415 >>>> 3 58417 >>>> >>>> which seems too much of (workers*reconnects) and triggers rate >>>> limiting >>>> immediately. >>>> >>>> [0] >>>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcd... >>>> >>>> >>>> [1] https://bugs.launchpad.net/tripleo/+bug/1889372 >>>> >>>> -- >>>> Best regards, >>>> Bogdan Dobrelya, >>>> Irc #bogdando >>>> >>> >>> FYI.. >>> >>> The issue w/ "too many requests" is back. Expect delays and >>> failures in attempting to merge your patches upstream across >>> all branches. The issue is being tracked as a critical issue. >> >> Working with the infra folks and we have identified the >> authorization >> header as causing issues when we're rediected from docker.io to >> cloudflare. I'll throw up a patch tomorrow to handle this case >> which >> should improve our usage of the cache. It needs some testing >> against >> other registries to ensure that we don't break authenticated >> fetching >> of resources. >> > Thanks Alex!

FYI.. we have been revisited by the container pull issue, "too many requests". Alex has some fresh patches on it: https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+...

expect trouble in check and gate: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...

-- Best regards, Bogdan Dobrelya, Irc #bogdando

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Rabi Mishra

29 Jul 29 Jul

2:49 a.m.

On Tue, Jul 28, 2020, 18:59 Emilien Macchi <emilien@redhat.com> wrote:

...

On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com> wrote:

...
On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com> wrote:

...
On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com>

...
...
FYI...

If you find your jobs are failing with an error similar to [1], you

have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers.

...
There are a few ways to mitigate the issue however I don't see any of

wrote: the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved.

...
...
For full transparency we're considering the following options.

1. move off of docker.io to quay.io

quay.io also has API rate limit: https://docs.quay.io/issues/429.html

Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. Also quay.io had its big downtimes as well, SLA needs to be considered.

...
2. local container builds for each job in master, possibly ussuri

Not convinced. You can look at CI logs: - pulling / updating / pushing container images from docker.io to local registry takes ~10 min on standalone (OVH) - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH).

...
3. parent child jobs upstream where rpms and containers will be build

and host artifacts for the child jobs

Yes, we need to investigate that.

...
4. remove some portion of the upstream jobs to lower the impact we

have on 3rd party infrastructure.

I'm not sure I understand this one, maybe you can give an example of what could be removed?

We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.

Oh... removing jobs (I thought we would remove some steps of the jobs). Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now.

We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more).

Isn't scenario010 testing octavia? Though I've seen toggling between voting/non-voting due to different issues for a long time.

...

-- Emilien Macchi

1789

Age (days ago)

1812

Last active (days ago)

List overview

Download

20 comments

8 participants

participants (8)

Alex Schultz
Bogdan Dobrelya
Cédric Jeanneret
Emilien Macchi
Jeremy Stanley
Luke Short
Rabi Mishra
Wesley Hayutin