[nova] Did something change about interface detach/attach from Yoga to Caracal?
I'm not sure why it was originally written, but I have an ancient system test that: - creates a VM attached to one network - finds the VM's port and calls interface_detach for it - checks that the VM is now unpingable - calls interface_attach with the relevant network_id - hard reboots the VM - checks that the VM is now pingable again. With Yoga, the test passes, and at the end of the test the VM has one NIC, eth0, with an IP address. With Caracal, the test also passes, but at the end of the test the VM has two NICs: - eth0 with no IP address - eth1 with an IP address. Diffing the Nova code, it looks like interface_detach and interface_attach might actually have been no-ops in Yoga - although possibly only for a "vdpa" vnic-type, and I'm unsure whether that vnic-type would be relevant to my test setup. Any ideas about this? Many thanks - Nell
Hi, I don't have this issue in Antelope, which is the current version of our production cloud. In my test cloud on Caracal I can't reproduce it either. For me it works as expected. But this week I saw something at least similar to what you describe. Due to a bug in the dashboard I was also testing detaching interfaces from instances. From neutron perspective detaching seemed to have worked, but there still was some kind of mixup, leading to an instance having two NICs (from neutron's perspective). It turned out to be a nova-compute issue. We had some network maintenance going on, and although 'openstack network agent list' and 'openstack compute service list' were all reported as up, there were still error messages in the logs about missing responses. Basically, I had to restart all nova-compute and neutron services to get out of that situation. So I'd recommend to check the "real" health status of all agents. And it could also be related to udev within the instance, although I wouldn't expect two NICs but eth1 instead of eth0. And where exactly do you see that the VM has two NICs? In the nova interface-list (so from openstack's perspective) or within the VM? If it's the latter, I'd expect some udev rule or something, so related to the VM's image. You can check /etc/udev/rules.d/70-persistent-net.rules (it may be a different path/file) for the current status. Zitat von Nell Jerram <nell@tigera.io>:
I'm not sure why it was originally written, but I have an ancient system test that: - creates a VM attached to one network - finds the VM's port and calls interface_detach for it - checks that the VM is now unpingable - calls interface_attach with the relevant network_id - hard reboots the VM - checks that the VM is now pingable again.
With Yoga, the test passes, and at the end of the test the VM has one NIC, eth0, with an IP address.
With Caracal, the test also passes, but at the end of the test the VM has two NICs: - eth0 with no IP address - eth1 with an IP address.
Diffing the Nova code, it looks like interface_detach and interface_attach might actually have been no-ops in Yoga - although possibly only for a "vdpa" vnic-type, and I'm unsure whether that vnic-type would be relevant to my test setup.
Any ideas about this?
Many thanks - Nell
Many thanks Eugen for your interest and reply on this issue. I did manage to 'resolve' this sufficiently by improving the relevant test code, and in case it is of any use to you, here is my commit message about that: ============== The `test_detach_interface` test: - creates a VM attached to one network - finds the VM's port and calls `interface_detach` for it - checks that the VM is now unpingable - calls `interface_attach` with the relevant `network_id` - hard reboots the VM - checks that the VM is now pingable again. Failure symptoms: 1. `test_detach_interface` fails in its last step with AssertionError: ['ana33z cannot ping 616fuw (10.28.0.131)', 'gmwmc8 cannot ping 616fuw (10.28.0.131)'] is not false : Some failures: ['ana33z cannot ping 616fuw (10.28.0.131)', 'gmwmc8 cannot ping 616fuw (10.28.0.131)'] where 616fuw is the VM that had an interface detached and reattached and then got rebooted. 2. Following that, `test_ifdown_interface` fails with the same assertion, on its last line of code where it is supposed to have full connectivity again. 3. `nova-compute.log` for about the time of the `test_detach_interface` test shows a `DeviceRemovedFailed` event. 4. `ip l` and `ip a` output from the beginning of `test_ifdown_interface` shows that the VM has two NICs: eth0 with no IPs and eth1 with an IP address. Whereas it's expected at this point that the VM only has one NIC, eth0, with an IP address. The minimal fixes are, both in `test_detach_interface`: 1. To make sure that the detached port really has gone, before re-attaching the VM to the network. Not doing this seems to allow the detach and attach operations to overlap with each other, resulting in the VM having two NICs instead of just one. That directly messes up `test_ifdown_interface`, because that test assumes that eth0 is the active NIC. 2. To make sure that the VM has become active again after its reboot, before testing for connectivity. (Weirdly, an alternative to (1) seems to be running `watch neutron port-list` in parallel with the test. This confused me for quite a while, because I wouldn't expect `neutron port-list` to modify any Neutron state! But it's highly reproducible that with only (2), plus running `watch neutron port-list` in parallel, the tests reliably pass; and that with only (2), and without any `port-list`, `test_ifdown_interface` reliably fails.) ============== In summary, in my case I don't think it was related to agent health, but rather to the detach operation running a bit slower than it did in previous releases, and hence being more likely to overlap with a subsequent attach operation, given that my test code was not careful enough about this. Best wishes - Nell On Sun, Apr 27, 2025 at 9:50 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
I don't have this issue in Antelope, which is the current version of our production cloud. In my test cloud on Caracal I can't reproduce it either. For me it works as expected. But this week I saw something at least similar to what you describe. Due to a bug in the dashboard I was also testing detaching interfaces from instances. From neutron perspective detaching seemed to have worked, but there still was some kind of mixup, leading to an instance having two NICs (from neutron's perspective). It turned out to be a nova-compute issue. We had some network maintenance going on, and although 'openstack network agent list' and 'openstack compute service list' were all reported as up, there were still error messages in the logs about missing responses. Basically, I had to restart all nova-compute and neutron services to get out of that situation.
So I'd recommend to check the "real" health status of all agents. And it could also be related to udev within the instance, although I wouldn't expect two NICs but eth1 instead of eth0. And where exactly do you see that the VM has two NICs? In the nova interface-list (so from openstack's perspective) or within the VM? If it's the latter, I'd expect some udev rule or something, so related to the VM's image. You can check /etc/udev/rules.d/70-persistent-net.rules (it may be a different path/file) for the current status.
Zitat von Nell Jerram <nell@tigera.io>:
I'm not sure why it was originally written, but I have an ancient system test that: - creates a VM attached to one network - finds the VM's port and calls interface_detach for it - checks that the VM is now unpingable - calls interface_attach with the relevant network_id - hard reboots the VM - checks that the VM is now pingable again.
With Yoga, the test passes, and at the end of the test the VM has one NIC, eth0, with an IP address.
With Caracal, the test also passes, but at the end of the test the VM has two NICs: - eth0 with no IP address - eth1 with an IP address.
Diffing the Nova code, it looks like interface_detach and interface_attach might actually have been no-ops in Yoga - although possibly only for a "vdpa" vnic-type, and I'm unsure whether that vnic-type would be relevant to my test setup.
Any ideas about this?
Many thanks - Nell
Thanks for the details. Just FYI, neutron cli commands are deprecated: root@controller01:~# neutron port-list neutron CLI is deprecated and will be removed in the Z cycle. Use openstack CLI instead. But it's interesting that listing ports with openstack cli takes almost twice as long as with neutron cli (this cloud has 641 ports): root@controller01:~# time neutron port-list ... real 0m2,460s user 0m1,421s sys 0m0,143s root@controller01:~# openstack port list --timing ... +--------------------------------------------------+----------+ | URL | Seconds | +--------------------------------------------------+----------+ | GET http://controller:5000/v3 | 0.010909 | | POST http://controller:5000/v3/auth/tokens | 0.034362 | | GET http://controller:9696/v2.0/ports?fields=... | 0.671336 | | Total | 0.716607 | +--------------------------------------------------+----------+ real 0m4,235s user 0m3,208s sys 0m0,306s Zitat von Nell Jerram <nell@tigera.io>:
Many thanks Eugen for your interest and reply on this issue. I did manage to 'resolve' this sufficiently by improving the relevant test code, and in case it is of any use to you, here is my commit message about that:
============== The `test_detach_interface` test: - creates a VM attached to one network - finds the VM's port and calls `interface_detach` for it - checks that the VM is now unpingable - calls `interface_attach` with the relevant `network_id` - hard reboots the VM - checks that the VM is now pingable again.
Failure symptoms:
1. `test_detach_interface` fails in its last step with
AssertionError: ['ana33z cannot ping 616fuw (10.28.0.131)', 'gmwmc8 cannot ping 616fuw (10.28.0.131)'] is not false : Some failures: ['ana33z cannot ping 616fuw (10.28.0.131)', 'gmwmc8 cannot ping 616fuw (10.28.0.131)']
where 616fuw is the VM that had an interface detached and reattached and then got rebooted.
2. Following that, `test_ifdown_interface` fails with the same assertion, on its last line of code where it is supposed to have full connectivity again.
3. `nova-compute.log` for about the time of the `test_detach_interface` test shows a `DeviceRemovedFailed` event.
4. `ip l` and `ip a` output from the beginning of `test_ifdown_interface` shows that the VM has two NICs: eth0 with no IPs and eth1 with an IP address. Whereas it's expected at this point that the VM only has one NIC, eth0, with an IP address.
The minimal fixes are, both in `test_detach_interface`:
1. To make sure that the detached port really has gone, before re-attaching the VM to the network. Not doing this seems to allow the detach and attach operations to overlap with each other, resulting in the VM having two NICs instead of just one. That directly messes up `test_ifdown_interface`, because that test assumes that eth0 is the active NIC.
2. To make sure that the VM has become active again after its reboot, before testing for connectivity.
(Weirdly, an alternative to (1) seems to be running `watch neutron port-list` in parallel with the test. This confused me for quite a while, because I wouldn't expect `neutron port-list` to modify any Neutron state! But it's highly reproducible that with only (2), plus running `watch neutron port-list` in parallel, the tests reliably pass; and that with only (2), and without any `port-list`, `test_ifdown_interface` reliably fails.) ==============
In summary, in my case I don't think it was related to agent health, but rather to the detach operation running a bit slower than it did in previous releases, and hence being more likely to overlap with a subsequent attach operation, given that my test code was not careful enough about this.
Best wishes - Nell
On Sun, Apr 27, 2025 at 9:50 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
I don't have this issue in Antelope, which is the current version of our production cloud. In my test cloud on Caracal I can't reproduce it either. For me it works as expected. But this week I saw something at least similar to what you describe. Due to a bug in the dashboard I was also testing detaching interfaces from instances. From neutron perspective detaching seemed to have worked, but there still was some kind of mixup, leading to an instance having two NICs (from neutron's perspective). It turned out to be a nova-compute issue. We had some network maintenance going on, and although 'openstack network agent list' and 'openstack compute service list' were all reported as up, there were still error messages in the logs about missing responses. Basically, I had to restart all nova-compute and neutron services to get out of that situation.
So I'd recommend to check the "real" health status of all agents. And it could also be related to udev within the instance, although I wouldn't expect two NICs but eth1 instead of eth0. And where exactly do you see that the VM has two NICs? In the nova interface-list (so from openstack's perspective) or within the VM? If it's the latter, I'd expect some udev rule or something, so related to the VM's image. You can check /etc/udev/rules.d/70-persistent-net.rules (it may be a different path/file) for the current status.
Zitat von Nell Jerram <nell@tigera.io>:
I'm not sure why it was originally written, but I have an ancient system test that: - creates a VM attached to one network - finds the VM's port and calls interface_detach for it - checks that the VM is now unpingable - calls interface_attach with the relevant network_id - hard reboots the VM - checks that the VM is now pingable again.
With Yoga, the test passes, and at the end of the test the VM has one NIC, eth0, with an IP address.
With Caracal, the test also passes, but at the end of the test the VM has two NICs: - eth0 with no IP address - eth1 with an IP address.
Diffing the Nova code, it looks like interface_detach and interface_attach might actually have been no-ops in Yoga - although possibly only for a "vdpa" vnic-type, and I'm unsure whether that vnic-type would be relevant to my test setup.
Any ideas about this?
Many thanks - Nell
participants (2)
-
Eugen Block
-
Nell Jerram