On 2021-02-25 16:17:33 +0000 (+0000), Mark Goddard wrote: [...]
If you read the original message, I don't feel that it ascribed blame to any party, only a description of what I have found in my investigation to date.
You said you had an unspecified bug reproducible in one of our donor providers, and wanted to know how to only run that job in providers where it would succeed. Seemed like blame to me. If your goal is to only have jobs which succeed, then there are lots of options. I'll assume that's not your actual goal though.
I'm investigating the issue, but was putting feelers out for anyone who might have seen something similar. It was lacking in detail, for sure. I'm not at a computer at the moment, but I'll provide more information when I am. [...]
Please do. I'd like to help, and to figure out if this is indicative of a broader problem within our infrastructure, but some detail is necessary before I (or anyone) can do that. For Vexxhost specifically, there was a recent flavor change which significantly increased the amount of RAM available on our nodes booted there. This went into effect when https://review.opendev.org/773710 merged on 2021-02-02, so if that roughly coincides with the appearance of your new bug then memory-related concurrency or configuration decisions might be a good place to start looking, as they could easily be exposed in that provider and not others. -- Jeremy Stanley