[docs] Rate limiting (?) on docs.openstack.org causing link checks to break
Hello everyone, here's something of a meta-issue that's been bothering me for a while, and I wonder if we can find a solution or a workaround. Our public documentation at docs.cleura.cloud naturally has cross-references to docs.openstack.org. To make sure that our documentation remains current, the CI/CD build for our docs includes an external link checker.[1] That way, if external documentation paths change or go away, we get notified of a build failure, and this enables us to fix or update our docs. Unfortunately, this results in random and intermittent errors like this: Real URL https://docs.openstack.org/keystone/ Check time 2.759 seconds Size 251B Result Error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) The Keystone docs URL is just an example; this sort of error happens all across the docs. Of the scores of docs.openstack.org links we check, it's usually just one that breaks, but it's always a different one. Kicking the pipeline and rerunning the same link check often "fixes" things, sometimes it doesn't, sometimes it breaks with a different link. I surmise that this is caused by a rate limit applicable on docs.openstack.org. I would understand why such a rate limit exists, so I am not suggesting to remove it, but right now I'm at my wit's end and I'm close to adding docs.openstack.org to the ignore regex in our linkchecker configuration. But I really don't *want* to do that because I'd quite prefer for our docs to contain good cross-references, and so I'm interested in keeping docs.o.o included in the link liveness check. I'm guessing others have run into this issue. If anyone has solved it in a clever way, I'd be grateful for suggestions. Thanks! Cheers, Florian [1] https://linkchecker.github.io/linkchecker/
On Thu, Dec 11, 2025, at 3:30 AM, Florian Haas wrote:
Hello everyone,
here's something of a meta-issue that's been bothering me for a while, and I wonder if we can find a solution or a workaround.
Our public documentation at docs.cleura.cloud naturally has cross-references to docs.openstack.org. To make sure that our documentation remains current, the CI/CD build for our docs includes an external link checker.[1] That way, if external documentation paths change or go away, we get notified of a build failure, and this enables us to fix or update our docs.
Unfortunately, this results in random and intermittent errors like this:
Real URL https://docs.openstack.org/keystone/ Check time 2.759 seconds Size 251B Result Error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
The Keystone docs URL is just an example; this sort of error happens all across the docs. Of the scores of docs.openstack.org links we check, it's usually just one that breaks, but it's always a different one.
Kicking the pipeline and rerunning the same link check often "fixes" things, sometimes it doesn't, sometimes it breaks with a different link.
I surmise that this is caused by a rate limit applicable on docs.openstack.org. I would understand why such a rate limit exists, so I am not suggesting to remove it, but right now I'm at my wit's end and I'm close to adding docs.openstack.org to the ignore regex in our linkchecker configuration. But I really don't *want* to do that because I'd quite prefer for our docs to contain good cross-references, and so I'm interested in keeping docs.o.o included in the link liveness check.
I'm guessing others have run into this issue. If anyone has solved it in a clever way, I'd be grateful for suggestions. Thanks!
The fundamental problem is the massive increase in web content crawlers all racing to capture more data from the Internet than their LLM training competitors. Most of the crawlers from the big recognizable names are relatively well behaved and respect things like crawl delay and don't recrawl the same data over and over again in a tight loop without checking if it has changed first. Then there are the crawlers that spoof user agents, crawl from thousands (or more) of different IP addresses, and will refetch the same data repeatedly. The impacts of this have been noticed across a number of the services we run and the documentation server was updated last month to quadruple the total number of connections it will accept: https://opendev.org/opendev/system-config/commit/37856f0bfed44dff4015faa0e6f... If you've seen this problem more recently then we can probably bump the values up further. Eventually we'll hit reasonable limits for this server and need to start load balancing it. I suspect that this is the Internet's version of induced demand. Building more road lanes is only going to put more cars (crawlers) on the road. We may need to consider smarter approaches. I know others have been deploying Anubis, but I worry that it won't be super effective when the bots are already spoofing agents and could in theory run some js locally. That said I think we'd be open to exploring the use of that tool or others that have been built to deal with this problem. TL;DR yes this is an ongoing issue. We've made recent changes that we think have helped. If you've seen problems more recently that is helpful to know as we can tune things further. But eventually we'll run out of resources on our end and need to add more or consider alternatives.
Cheers, Florian
[1] https://linkchecker.github.io/linkchecker/
Attachments: * smime.p7s
Hi Clark! On 11/12/2025 18:18, Clark Boylan wrote:
I'm guessing others have run into this issue. If anyone has solved it in a clever way, I'd be grateful for suggestions. Thanks!
The fundamental problem is the massive increase in web content crawlers all racing to capture more data from the Internet than their LLM training competitors. Most of the crawlers from the big recognizable names are relatively well behaved and respect things like crawl delay and don't recrawl the same data over and over again in a tight loop without checking if it has changed first. Then there are the crawlers that spoof user agents, crawl from thousands (or more) of different IP addresses, and will refetch the same data repeatedly.
Right, I figured that was the reason. I completely sympathise.
The impacts of this have been noticed across a number of the services we run and the documentation server was updated last month to quadruple the total number of connections it will accept: https://opendev.org/opendev/system-config/commit/37856f0bfed44dff4015faa0e6f... If you've seen this problem more recently then we can probably bump the values up further. Eventually we'll hit reasonable limits for this server and need to start load balancing it.
You probably hate to hear this and I hate to be pointing this out, but yes we've seen the issue more recently. In fact, it goes back some time, and we haven't noticed any recent change in performance/behaviour at all. However, I also fully appreciate that this may well be a result of us running our CI/CD in GitHub, like pretty much everybody and their sibling (including, presumably, actors of questionable intentions), so I can't think of a reasonable suggestion for you to change things. And also, this is a non-trivial issue for the link check facility as well. Consider this issue, which was already discussed in 2018: https://github.com/linkchecker/linkchecker/issues/169 ... but this was before the advent of LLM crawlers, so it's likely to be much more complicated today. I have tried modifying our linkchecker configuration from 10 concurrent connections all the way down to 1; this produced no change in results. Again, I am becoming fairly convinced that this isn't an issue you can (or even should) change from the docs.o.o end. Rather, I think it's a better idea to not use GitHub Actions for link checks, or perhaps not even continue to use per-commit link checking at all. Thanks for the reply! Cheers, Florian
participants (2)
-
Clark Boylan
-
Florian Haas