On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com> wrote:
On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com> wrote:
On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien@redhat.com> wrote:
On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin@redhat.com>
FYI...
If you find your jobs are failing with an error similar to [1], you
have been rate limited by docker.io via the upstream mirror system and have hit [2]. I've been discussing the issue w/ upstream infra, rdo-infra and a few CI engineers.
There are a few ways to mitigate the issue however I don't see any of
wrote: the options being completed very quickly so I'm asking for your patience while this issue is socialized and resolved.
For full transparency we're considering the following options.
1. move off of docker.io to quay.io
quay.io also has API rate limit: https://docs.quay.io/issues/429.html
Now I'm not sure about how many requests per seconds one can do vs the other but this would need to be checked with the quay team before changing anything. Also quay.io had its big downtimes as well, SLA needs to be considered.
2. local container builds for each job in master, possibly ussuri
Not convinced. You can look at CI logs: - pulling / updating / pushing container images from docker.io to local registry takes ~10 min on standalone (OVH) - building containers from scratch with updated repos and pushing them to local registry takes ~29 min on standalone (OVH).
3. parent child jobs upstream where rpms and containers will be build
and host artifacts for the child jobs
Yes, we need to investigate that.
4. remove some portion of the upstream jobs to lower the impact we
have on 3rd party infrastructure.
I'm not sure I understand this one, maybe you can give an example of what could be removed?
We need to re-evaulate our use of scenarios (e.g. we have two scenario010's both are non-voting). There's a reason we historically didn't want to add more jobs because of these types of resource constraints. I think we've added new jobs recently and likely need to reduce what we run. Additionally we might want to look into reducing what we run on stable branches as well.
Oh... removing jobs (I thought we would remove some steps of the jobs). Yes big +1, this should be a continuous goal when working on CI, and always evaluating what we need vs what we run now.
We should look at: 1) services deployed in scenarios that aren't worth testing (e.g. deprecated or unused things) (and deprecate the unused things) 2) jobs themselves (I don't have any example beside scenario010 but I'm sure there are more). -- Emilien Macchi
Thanks Alex, Emilien +1 to reviewing the catalog and adjusting things on an ongoing basis. All.. it looks like the issues with docker.io were more of a flare up than a change in docker.io policy or infrastructure [2]. The flare up started on July 27 8am utc and ended on July 27 17:00 utc, see screenshots. I've socialized the issue with the CI team and some ways to reduce our reliance on docker.io or any public registry. Sagi and I have a draft design that we'll share on this list after a first round of a POC. We also thought we'd leverage Emilien's awesome work [1] to build containers locally in standalone for widely to reduce our traffic to docker.io and upstream proxies. TLDR, feel free to recheck and wf. Thanks for your patience!! [1] https://review.opendev.org/#/q/status:open++topic:dos_docker.io [2] link to logstash query be sure to change the time range <http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1>