Hi, Some of the Kayobe CI jobs are failing when run on the vexxhost cloud. The particular part that fails is when testing 'bare metal' deployment using libvirt/qemu VMs. As far as I can tell, the test passes fairly reliably on other clouds, and fails reliably on vexxhost. Does anyone have any suggestion as to what might be causing this? priteau suggested entropy could be an issue, but we have not investigated yet. Is there any way to block a particular cloud provider for certain CI jobs until this is resolved? Thanks, Mark
On Thu, 25 Feb 2021 at 09:19, Mark Goddard <mark@stackhpc.com> wrote:
Hi,
Some of the Kayobe CI jobs are failing when run on the vexxhost cloud. The particular part that fails is when testing 'bare metal' deployment using libvirt/qemu VMs. As far as I can tell, the test passes fairly reliably on other clouds, and fails reliably on vexxhost.
Does anyone have any suggestion as to what might be causing this? priteau suggested entropy could be an issue, but we have not investigated yet.
priteau proposed a patch [1] to add entropy to CI logs. I found one job [2] that ran on vexxhost and it had over 3k, which I understand to be sufficient. [1] https://review.opendev.org/c/openstack/kayobe/+/777365 [2] https://zuul.opendev.org/t/openstack/build/872549209ff240b7aa52b1430896cd86/
Is there any way to block a particular cloud provider for certain CI jobs until this is resolved?
Thanks, Mark
On 2021-02-25 09:19:38 +0000 (+0000), Mark Goddard wrote: [...]
Is there any way to block a particular cloud provider for certain CI jobs until this is resolved?
So just to get this straight, it sounds like you've got a bug in your software or your tests which you can only reproduce on one cloud provider, and you're asking if we can find a way for you to be able to ignore this bug for now so you can go on releasing with it? If some of your users come to you saying your software is failing for them, do you suggest that they turn off their systems/stop running the software? How about start with a link to a failure representative of the problem, so others can help you try to identify the cause. The build you linked in your followup message succeeded, so not much help in figuring out why it's sometimes failing. Also, no, we don't really have a means to allow projects to decide which providers their jobs will run in. If most everyone's jobs are failing in some provider we disable that provider and work with them to figure out the problem. If only one project's jobs are failing, it tends to suggest the problem is not with the provider, or maybe with a feature in that provider which no other project is trying to use. Either way, let's get some detail about the problem and go from there. Maybe the issue is more widespread and we can try to find evidence to bring to the provider's support team, but so far you've provided no evidence to back that assertion. -- Jeremy Stanley
On Thu, 25 Feb 2021 at 14:27, Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2021-02-25 09:19:38 +0000 (+0000), Mark Goddard wrote: [...]
Is there any way to block a particular cloud provider for certain CI jobs until this is resolved?
So just to get this straight, it sounds like you've got a bug in your software or your tests which you can only reproduce on one cloud provider, and you're asking if we can find a way for you to be able to ignore this bug for now so you can go on releasing with it? If some of your users come to you saying your software is failing for them, do you suggest that they turn off their systems/stop running the software? How about start with a link to a failure representative of the problem, so others can help you try to identify the cause. The build you linked in your followup message succeeded, so not much help in figuring out why it's sometimes failing.
Also, no, we don't really have a means to allow projects to decide which providers their jobs will run in. If most everyone's jobs are failing in some provider we disable that provider and work with them to figure out the problem. If only one project's jobs are failing, it tends to suggest the problem is not with the provider, or maybe with a feature in that provider which no other project is trying to use. Either way, let's get some detail about the problem and go from there. Maybe the issue is more widespread and we can try to find evidence to bring to the provider's support team, but so far you've provided no evidence to back that assertion. -- Jeremy Stanley
I don't really appreciate the public dressing down, thanks.
On 2021-02-25 14:37:23 +0000 (+0000), Mark Goddard wrote: [...]
I don't really appreciate the public dressing down, thanks.
And I don't especially appreciate the infrastructure I help run being blamed for what is just as likely a bug in software you're maintaining, without so much as a hint as to what you're seeing. It happens constantly, from many corners of the community, and is demoralizing to the point where our systems administrators quit because they feel unappreciated. It leads directly to not being able to provide you with any testing at all. So asking again, can you *please* at least link to an example failure? -- Jeremy Stanley
On Thu, 25 Feb 2021, 15:25 Jeremy Stanley, <fungi@yuggoth.org> wrote:
On 2021-02-25 14:37:23 +0000 (+0000), Mark Goddard wrote: [...]
I don't really appreciate the public dressing down, thanks.
And I don't especially appreciate the infrastructure I help run being blamed for what is just as likely a bug in software you're maintaining, without so much as a hint as to what you're seeing. It happens constantly, from many corners of the community, and is demoralizing to the point where our systems administrators quit because they feel unappreciated. It leads directly to not being able to provide you with any testing at all.
If you read the original message, I don't feel that it ascribed blame to any party, only a description of what I have found in my investigation to date. I'm investigating the issue, but was putting feelers out for anyone who might have seen something similar. It was lacking in detail, for sure. I'm not at a computer at the moment, but I'll provide more information when I am.
So asking again, can you *please* at least link to an example failure? -- Jeremy Stanley
On 2021-02-25 16:17:33 +0000 (+0000), Mark Goddard wrote: [...]
If you read the original message, I don't feel that it ascribed blame to any party, only a description of what I have found in my investigation to date.
You said you had an unspecified bug reproducible in one of our donor providers, and wanted to know how to only run that job in providers where it would succeed. Seemed like blame to me. If your goal is to only have jobs which succeed, then there are lots of options. I'll assume that's not your actual goal though.
I'm investigating the issue, but was putting feelers out for anyone who might have seen something similar. It was lacking in detail, for sure. I'm not at a computer at the moment, but I'll provide more information when I am. [...]
Please do. I'd like to help, and to figure out if this is indicative of a broader problem within our infrastructure, but some detail is necessary before I (or anyone) can do that. For Vexxhost specifically, there was a recent flavor change which significantly increased the amount of RAM available on our nodes booted there. This went into effect when https://review.opendev.org/773710 merged on 2021-02-02, so if that roughly coincides with the appearance of your new bug then memory-related concurrency or configuration decisions might be a good place to start looking, as they could easily be exposed in that provider and not others. -- Jeremy Stanley
On Thu, Feb 25, 2021 at 4:24 AM Mark Goddard <mark@stackhpc.com> wrote:
Hi,
Some of the Kayobe CI jobs are failing when run on the vexxhost cloud. The particular part that fails is when testing 'bare metal' deployment using libvirt/qemu VMs. As far as I can tell, the test passes fairly reliably on other clouds, and fails reliably on vexxhost.
Can you please provide logs or any exact error outcomes of what's happening so I can try and help? If we want to dig in, we'll need a little bit more information. I suspect it may be nested virtualization at play here which is always a tricky thing. Are you using that or are you using plain qemu?
Does anyone have any suggestion as to what might be causing this? priteau suggested entropy could be an issue, but we have not investigated yet.
Is there any way to block a particular cloud provider for certain CI jobs until this is resolved?
Thanks, Mark
-- Mohammed Naser VEXXHOST, Inc.
On Thu, 25 Feb 2021 at 15:22, Mohammed Naser <mnaser@vexxhost.com> wrote:
On Thu, Feb 25, 2021 at 4:24 AM Mark Goddard <mark@stackhpc.com> wrote:
Hi,
Some of the Kayobe CI jobs are failing when run on the vexxhost cloud. The particular part that fails is when testing 'bare metal' deployment using libvirt/qemu VMs. As far as I can tell, the test passes fairly reliably on other clouds, and fails reliably on vexxhost.
Can you please provide logs or any exact error outcomes of what's happening so I can try and help?
If we want to dig in, we'll need a little bit more information. I suspect it may be nested virtualization at play here which is always a tricky thing. Are you using that or are you using plain qemu?
Hi Mohammed, thanks for the offer of help. You are quite right about nested virt - I tracked the issue down to a change to our test configuration that inadvertently stopped us forcing the use of qemu. With that change reverted, the job does now pass on vexxhost infrastructure.
Does anyone have any suggestion as to what might be causing this? priteau suggested entropy could be an issue, but we have not investigated yet.
Is there any way to block a particular cloud provider for certain CI jobs until this is resolved?
Thanks, Mark
-- Mohammed Naser VEXXHOST, Inc.
participants (3)
-
Jeremy Stanley
-
Mark Goddard
-
Mohammed Naser