[neutron] Dadfailed of ipv6 metadata IP in qdhcp namespace and disappearing dhcp namespaces

Kamil Madáč kamil.madac at slovenskoit.sk
Sun Jan 2 15:51:45 UTC 2022


In our small cloud environment, we started to see weird behavior during last 2 months. Dhcp namespaces started to disappear randomly, which caused that VMs losed connectivity once dhcp lease expired.
After the investigation I found out following issue/bug:

  1.  ipv6 metadata address of tap interface in some qdhcp-xxxx namespaces are stucked in "dadfailed tentative" state (i do not know why yet)
  3.  root at cloud01:~# ip netns exec qdhcp-3094b264-829b-4381-9ca2-59b3a3fc1ea1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2585: tap1797d9b1-e1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:77:64:0d brd ff:ff:ff:ff:ff:ff
    inet brd scope global tap1797d9b1-e1
       valid_lft forever preferred_lft forever
    inet brd scope global tap1797d9b1-e1
       valid_lft forever preferred_lft forever
    inet6 fe80::a9fe:a9fe/64 scope link dadfailed tentative
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe77:640d/64 scope link
       valid_lft forever preferred_lft forever
  5.  This blocked dhcp agent to finish sync_state function, and NetworkCache was not updated with subnets of such neutron network
  6.  During creation of VM assigned to such network, agent does not detect any subnets (see point 2), so he thinks (reload_allocations()) there is no dhcp needed and deletes qdhcp-xxxx namespace, so no DHCP and no Metadata are working on such network since that moment, and after 24h we see connectivity issues.
  7.  Restart of DHCP agent recreates missing qdhcp-xxxx namespaces, but NetworkCache  in dhcp agent is again empty, so creation of VM deletes the qdhcp-xxxx namespace again 🙁

Workaround is to remove dhcp agent from that network and add it again. Interestingly, sometimes I need to do it multiple times, because in few cases tap interface in new qdhcp finishes again in dadfailed tentative state. After year in production we have 20 networks out of 60 in such state.

We are using kolla-ansible deployment on Ubuntu 20.04, kernel 5.4.0-65-generic. Openstack version Victoria and neutron is in version 17.2.2.dev70.

Is that bug in neutron, or is it misconfiguration of OS on our side?

I'm locally testing patch which disables ipv6 dad in qdhcp-xxxx namespace (net.ipv6.conf.default.accept_dad = 1), but I'm not sure it is good solution when it comes to other neutron features?

Kamil Madáč
Slovensko IT a.s.

