cyborg-tempest-plugin test failed due to the zuul server has no accelerators
Hello Everyone, Is the Zuul server env changed recently? There are no accelerators and our cyborg-tempest-plugin test failed. Please help. Thanks. Best Regards.
On 2021-06-21 02:25:51 +0000 (+0000), Alex Song (宋文平) wrote: [...]
Is the Zuul server env changed recently? There are no accelerators and our cyborg-tempest-plugin test failed. [...]
Please provide a link to an example build which is failing and was previously succeeding. There's not enough information in your message to begin trying to troubleshoot whatever situation you're running into, and I'd rather not guess. An example will tell us what node type you've configured and whether it's some sort of specialty flavor at one of our providers, for example ubuntu-bionic-gpu, since our standard labels don't guarantee the presence (nor absence) of any specialized accelerator hardware. -- Jeremy Stanley
Hi Jeremy, Thanks for your reply, I have drafted a doc[1] describing the issue we met, could you please help check it. We didn't specify any label before as I remembered, it seems we do need that now. Could you please check the document, and is there any guidance to help us to add such label for cyborg tempest plugin? Thanks in advance. [1] https://docs.google.com/document/d/1dP3s24VugOb5ppvcO-sDkFL6Svzetxk7aiaul3ay... Thanks, Xin-Ran -----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Monday, June 21, 2021 8:49 PM To: openstack-discuss@lists.openstack.org Subject: Re: [infra][cyborg] cyborg-tempest-plugin test failed due to the zuul server has no accelerators On 2021-06-21 02:25:51 +0000 (+0000), Alex Song (宋文平) wrote: [...]
Is the Zuul server env changed recently? There are no accelerators and our cyborg-tempest-plugin test failed. [...]
Please provide a link to an example build which is failing and was previously succeeding. There's not enough information in your message to begin trying to troubleshoot whatever situation you're running into, and I'd rather not guess. An example will tell us what node type you've configured and whether it's some sort of specialty flavor at one of our providers, for example ubuntu-bionic-gpu, since our standard labels don't guarantee the presence (nor absence) of any specialized accelerator hardware. -- Jeremy Stanley
On 2021-06-21 15:21:59 +0000 (+0000), Wang, Xin-ran wrote:
Thanks for your reply, I have drafted a doc[1] describing the issue we met, could you please help check it. We didn't specify any label before as I remembered, it seems we do need that now.
Could you please check the document, and is there any guidance to help us to add such label for cyborg tempest plugin? [...]
I managed to work out how to open the Google Doc, but the information there just shows a Python traceback for a specific Tempest test, which doesn't really help me to understand which Zuul job is failing to build. Can you please point to a failing build of a job so we can find out what nodeset or node label it's trying to use? -- Jeremy Stanley
Hi Jeremy Stanley, There is a patch you can check https://review.opendev.org/c/openstack/cyborg/+/790937 , tempest failed https://050bde8a54f119be7071-8157e9570cd7007a824b373cbf52d06c.ssl.cf2.rackcd... Recently there is always report "no valid host" when create an accelerator server, as below, that out of our control :(, """ tempest.exceptions.BuildErrorException: Server feef6015-5211-481b-813f-c5924cdf6931 failed to build and is in ERROR status Details: {'code': 500, 'created': '2021-06-21T01:13:52Z', 'message': 'No valid host was found. '} """ Thanks Jeremy. brinzhang Inspur Electronic Information Industry Co.,Ltd. -----邮件原件----- 发件人: Jeremy Stanley [mailto:fungi@yuggoth.org] 发送时间: 2021年6月22日 2:19 收件人: openstack-discuss@lists.openstack.org 主题: [lists.openstack.org代发]Re: [infra][cyborg] cyborg-tempest-plugin test failed due to the zuul server has no accelerators On 2021-06-21 15:21:59 +0000 (+0000), Wang, Xin-ran wrote:
Thanks for your reply, I have drafted a doc[1] describing the issue we met, could you please help check it. We didn't specify any label before as I remembered, it seems we do need that now.
Could you please check the document, and is there any guidance to help us to add such label for cyborg tempest plugin? [...]
I managed to work out how to open the Google Doc, but the information there just shows a Python traceback for a specific Tempest test, which doesn't really help me to understand which Zuul job is failing to build. Can you please point to a failing build of a job so we can find out what nodeset or node label it's trying to use? -- Jeremy Stanley
On 2021-06-22 00:23:32 +0000 (+0000), Brin Zhang(张百林) wrote:
There is a patch you can check https://review.opendev.org/c/openstack/cyborg/+/790937 , tempest failed https://050bde8a54f119be7071-8157e9570cd7007a824b373cbf52d06c.ssl.cf2.rackcd...
Thanks, that helps. The build history indicates that the job was succeeding for openstack/cyborg up through 2021-06-09 08:22 UTC, but was failing consistently as of 2021-06-10 09:26 UTC, so something probably changed in that 24 hour period to affect the job: https://zuul.opendev.org/t/openstack/builds?job_name=cyborg-tempest&project=openstack/cyborg Broadening that query to other projects, I can see it succeeded as recently as 2021-06-10 01:34 for an openstack/nova change in check. What's more interesting is that it's continuing to succeed consistently for stable branches, even stable/wallaby, just not master. Both the succeeding and failing builds for master ran on regular ubuntu-focal nodes in a number of different cloud providers which don't have any specialized accelerator hardware, so I have to assume what's changed has nothing to do with the underlying test environment.
Recently there is always report "no valid host" when create an accelerator server, as below, that out of our control :(, """ tempest.exceptions.BuildErrorException: Server feef6015-5211-481b-813f-c5924cdf6931 failed to build and is in ERROR status Details: {'code': 500, 'created': '2021-06-21T01:13:52Z', 'message': 'No valid host was found. '} """ [...]
This is when scheduling an accelerator within DevStack, right? Were you maybe using some sort of mock/fake accelerator for testing purposes? Because there wouldn't have been actual accelerators exposed to that environment even back when the job was still succeeding. Regardless, I suspect something merged early UTC on 2021-06-10 to the master branch of one of the services or tools with which Cyborg interacting to cause this error to begin appearing. The fact that the same job is running fine for stable/wallaby also indicates it's probably some behavior which hasn't been backported yet. Hopefully that helps narrow it down. -- Jeremy Stanley
participants (4)
-
Alex Song (宋文平)
-
Brin Zhang(张百林)
-
Jeremy Stanley
-
Wang, Xin-ran