Hi Clark! On 11/12/2025 18:18, Clark Boylan wrote:
I'm guessing others have run into this issue. If anyone has solved it in a clever way, I'd be grateful for suggestions. Thanks!
The fundamental problem is the massive increase in web content crawlers all racing to capture more data from the Internet than their LLM training competitors. Most of the crawlers from the big recognizable names are relatively well behaved and respect things like crawl delay and don't recrawl the same data over and over again in a tight loop without checking if it has changed first. Then there are the crawlers that spoof user agents, crawl from thousands (or more) of different IP addresses, and will refetch the same data repeatedly.
Right, I figured that was the reason. I completely sympathise.
The impacts of this have been noticed across a number of the services we run and the documentation server was updated last month to quadruple the total number of connections it will accept: https://opendev.org/opendev/system-config/commit/37856f0bfed44dff4015faa0e6f... If you've seen this problem more recently then we can probably bump the values up further. Eventually we'll hit reasonable limits for this server and need to start load balancing it.
You probably hate to hear this and I hate to be pointing this out, but yes we've seen the issue more recently. In fact, it goes back some time, and we haven't noticed any recent change in performance/behaviour at all. However, I also fully appreciate that this may well be a result of us running our CI/CD in GitHub, like pretty much everybody and their sibling (including, presumably, actors of questionable intentions), so I can't think of a reasonable suggestion for you to change things. And also, this is a non-trivial issue for the link check facility as well. Consider this issue, which was already discussed in 2018: https://github.com/linkchecker/linkchecker/issues/169 ... but this was before the advent of LLM crawlers, so it's likely to be much more complicated today. I have tried modifying our linkchecker configuration from 10 concurrent connections all the way down to 1; this produced no change in results. Again, I am becoming fairly convinced that this isn't an issue you can (or even should) change from the docs.o.o end. Rather, I think it's a better idea to not use GitHub Actions for link checks, or perhaps not even continue to use per-commit link checking at all. Thanks for the reply! Cheers, Florian