Hello Kamil, We also experienced this issue after upgrading to Victoria, which introduced availability of the metadata service over the IPv6 link-local address fe80::a9fe:a9fe. In bug 1953165 I mentioned a workaround. Below is the script I used, which should be followed by a restart of neutron-dhcp-agent. #!/bin/bash for ns in $(ip netns | grep -o 'qdhcp[^ ]*'); do if sudo ip netns exec $ns ip a | grep dadfailed > /dev/null; then tap=$(sudo ip netns exec $ns ip link | grep -o 'tap[^:]*') echo "Cleaning up IPv6 from $tap on $ns" sudo ip netns exec $ns ip addr del fe80::a9fe:a9fe/64 dev $tap fi done On Mon, 3 Jan 2022 at 10:02, Kamil Madáč <kamil.madac@slovenskoit.sk> wrote:
Hi Brian,
thank you very much for pointing to those bugs. It is exactly what we are experiencing in our deployment. I will follow-up in those bugs then.
Kamil ________________________________ From: Brian Haley <haleyb.dev@gmail.com> Sent: Monday, January 3, 2022 2:35 AM To: Kamil Madáč <kamil.madac@slovenskoit.sk>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [neutron] Dadfailed of ipv6 metadata IP in qdhcp namespace and disappearing dhcp namespaces
Hi,
On 1/2/22 10:51 AM, Kamil Madáč wrote:
Hello,
In our small cloud environment, we started to see weird behavior during last 2 months. Dhcp namespaces started to disappear randomly, which caused that VMs losed connectivity once dhcp lease expired. After the investigation I found out following issue/bug:
1. ipv6 metadata address of tap interface in some qdhcp-xxxx namespaces are stucked in "dadfailed tentative" state (i do not know why yet)
This issue was reported about a month ago:
https://bugs.launchpad.net/neutron/+bug/1953165
And Bence marked it a duplicate of:
https://bugs.launchpad.net/neutron/+bug/1930414
Seems to be a bug in a flow based on the title - "Traffic leaked from dhcp port before vlan tag is applied".
I would follow-up in that second bug.
Thanks,
-Brian
3. root@cloud01:~# ip netns exec qdhcp-3094b264-829b-4381-9ca2-59b3a3fc1ea1 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2585: tap1797d9b1-e1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:77:64:0d brd ff:ff:ff:ff:ff:ff inet 169.254.169.254/32 brd 169.254.169.254 scope global tap1797d9b1-e1 valid_lft forever preferred_lft forever inet 192.168.0.2/24 brd 192.168.0.255 scope global tap1797d9b1-e1 valid_lft forever preferred_lft forever inet6 fe80::a9fe:a9fe/64 scope link dadfailed tentative valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe77:640d/64 scope link valid_lft forever preferred_lft forever 4.
5. This blocked dhcp agent to finish sync_state function, and NetworkCache was not updated with subnets of such neutron network 6. During creation of VM assigned to such network, agent does not detect any subnets (see point 2), so he thinks (reload_allocations()) there is no dhcp needed and deletes qdhcp-xxxx namespace, so no DHCP and no Metadata are working on such network since that moment, and after 24h we see connectivity issues. 7. Restart of DHCP agent recreates missing qdhcp-xxxx namespaces, but NetworkCache in dhcp agent is again empty, so creation of VM deletes the qdhcp-xxxx namespace again 🙁
Workaround is to remove dhcp agent from that network and add it again. Interestingly, sometimes I need to do it multiple times, because in few cases tap interface in new qdhcp finishes again in dadfailed tentative state. After year in production we have 20 networks out of 60 in such state.
We are using kolla-ansible deployment on Ubuntu 20.04, kernel 5.4.0-65-generic. Openstack version Victoria and neutron is in version 17.2.2.dev70.
Is that bug in neutron, or is it misconfiguration of OS on our side?
I'm locally testing patch which disables ipv6 dad in qdhcp-xxxx namespace (net.ipv6.conf.default.accept_dad = 1), but I'm not sure it is good solution when it comes to other neutron features?
Kamil Madáč /Slovensko IT a.s./