[ironic] [thirdparty-ci] BaremetalBasicOps test
Hello all, Our ironic job has been broken and it seems to be due to a lack of IPs. We allocate two IPs to our job, one for the dhcp server, and one for the target node. This had been working for as long as the job has existed but recently (since about early December 2018), we've been broken. The job is able to clean the node during devstack, successfully deploy to the node during the tempest run, and is successfully validated via ssh. The node then moves to clean failed with a network error [1], and the job subsequently fails. Sometime between the validation and attempting to clean, the neutron port associated with the ironic port is deleted and a new port comes into existence. Where I'm having trouble is finding out what this port is. Based on it's MAC address It's a virtual port, and its MAC is not the same as the ironic port. We could add an IP to the job to fix it, but I'd rather not do that needlessly. Any insight or advice would be appreciated here! Thanks, Mike Turek <mjturek> [1] http://paste.openstack.org/show/743191/
On Thu, Jan 31, 2019 at 8:37 AM Michael Turek <mjturek@linux.vnet.ibm.com> wrote:
[trim]
The job is able to clean the node during devstack, successfully deploy to the node during the tempest run, and is successfully validated via ssh. The node then moves to clean failed with a network error [1], and the job subsequently fails. Sometime between the validation and attempting to clean, the neutron port associated with the ironic port is deleted and a new port comes into existence. Where I'm having trouble is finding out what this port is. Based on it's MAC address It's a virtual port, and its MAC is not the same as the ironic port.
I think we landed code around then to address the issue of duplicate mac addresses where a port gets orphaned by external processes, so by default I seem to remember the logic now just resets the MAC if we no longer need the port. What are the network settings your operating the job with? It seems like 'flat' is at least the network_interface based on what your describing.
We could add an IP to the job to fix it, but I'd rather not do that needlessly.
Hey Julia On 2/4/19 8:51 AM, Julia Kreger wrote:
The job is able to clean the node during devstack, successfully deploy to the node during the tempest run, and is successfully validated via ssh. The node then moves to clean failed with a network error [1], and the job subsequently fails. Sometime between the validation and attempting to clean, the neutron port associated with the ironic port is deleted and a new port comes into existence. Where I'm having trouble is finding out what this port is. Based on it's MAC address It's a virtual port, and its MAC is not the same as the ironic port. I think we landed code around then to address the issue of duplicate mac addresses where a port gets orphaned by external processes, so by default I seem to remember the logic now just resets the MAC if we no longer need the port. Interesting! I'll look for the patch. If you have it handy please share. What are the network settings your operating the job with? It seems
On Thu, Jan 31, 2019 at 8:37 AM Michael Turek <mjturek@linux.vnet.ibm.com> wrote: [trim] like 'flat' is at least the network_interface based on what your describing. We are using a single flat provider network with two available IPs (one for the DHCP server and one for the server itself)
Here is a paste of a bunch of the network resources (censored here and there just in case). http://paste.openstack.org/show/744513/
We could add an IP to the job to fix it, but I'd rather not do that needlessly.
On Thu, 2019-01-31 at 11:30 -0500, Michael Turek wrote:
Hello all,
Our ironic job has been broken and it seems to be due to a lack of IPs. We allocate two IPs to our job, one for the dhcp server, and one for the target node. This had been working for as long as the job has existed but recently (since about early December 2018), we've been broken.
The job is able to clean the node during devstack, successfully deploy to the node during the tempest run, and is successfully validated via ssh. The node then moves to clean failed with a network error [1], and the job subsequently fails. Sometime between the validation and attempting to clean, the neutron port associated with the ironic port is deleted and a new port comes into existence. Where I'm having trouble is finding out what this port is. Based on it's MAC address It's a virtual port, and its MAC is not the same as the ironic port.
We could add an IP to the job to fix it, but I'd rather not do that needlessly.
Any insight or advice would be appreciated here!
While working on the neutron events I noticed a pattern I thought was a bit strange. (Note, this was with neutron networking.) Create nova baremetal instance: 1. The tenant VIF is created. 2. The provision port is created. 3. Provision port plugged (bound) 4. Provision port un-plugged (deleted) 5. Tenant port plugged (bound) On nova delete of barametal instance: 1. Tenant VIF is un-plugged (unbound) 2. Cleaning port created 3. Cleaning port plugged (bound) 4. Cleaning port un-plugged (deleted) 5. Tenant port deleted I think step 5, deleting the tenant port could happen after step 1. But it looks like it is'nt deleted before after cleaning is done. If this is the case with flat networks as well it could explain why you get the error on cleaning. The "tenant" port still exist, and there are no free IP's in the allocation pool to create a new port for cleaning. -- Harald
participants (3)
-
Harald Jensås
-
Julia Kreger
-
Michael Turek