[neutron] Dadfailed of ipv6 metadata IP in qdhcp namespace and disappearing dhcp namespaces
Pierre Riteau
pierre at stackhpc.com
Thu Jan 6 09:55:54 UTC 2022
Hello Kamil,
We also experienced this issue after upgrading to Victoria, which
introduced availability of the metadata service over the IPv6
link-local address fe80::a9fe:a9fe.
In bug 1953165 I mentioned a workaround. Below is the script I used,
which should be followed by a restart of neutron-dhcp-agent.
#!/bin/bash
for ns in $(ip netns | grep -o 'qdhcp[^ ]*'); do
if sudo ip netns exec $ns ip a | grep dadfailed > /dev/null; then
tap=$(sudo ip netns exec $ns ip link | grep -o 'tap[^:]*')
echo "Cleaning up IPv6 from $tap on $ns"
sudo ip netns exec $ns ip addr del fe80::a9fe:a9fe/64 dev $tap
fi
done
On Mon, 3 Jan 2022 at 10:02, Kamil Madáč <kamil.madac at slovenskoit.sk> wrote:
>
> Hi Brian,
>
> thank you very much for pointing to those bugs. It is exactly what we are experiencing in our deployment. I will follow-up in those bugs then.
>
> Kamil
> ________________________________
> From: Brian Haley <haleyb.dev at gmail.com>
> Sent: Monday, January 3, 2022 2:35 AM
> To: Kamil Madáč <kamil.madac at slovenskoit.sk>; openstack-discuss <openstack-discuss at lists.openstack.org>
> Subject: Re: [neutron] Dadfailed of ipv6 metadata IP in qdhcp namespace and disappearing dhcp namespaces
>
> Hi,
>
> On 1/2/22 10:51 AM, Kamil Madáč wrote:
> > Hello,
> >
> > In our small cloud environment, we started to see weird behavior during
> > last 2 months. Dhcp namespaces started to disappear randomly, which
> > caused that VMs losed connectivity once dhcp lease expired.
> > After the investigation I found out following issue/bug:
> >
> > 1. ipv6 metadata address of tap interface in some qdhcp-xxxx namespaces
> > are stucked in "dadfailed tentative" state (i do not know why yet)
>
> This issue was reported about a month ago:
>
> https://bugs.launchpad.net/neutron/+bug/1953165
>
> And Bence marked it a duplicate of:
>
> https://bugs.launchpad.net/neutron/+bug/1930414
>
> Seems to be a bug in a flow based on the title - "Traffic leaked from
> dhcp port before vlan tag is applied".
>
> I would follow-up in that second bug.
>
> Thanks,
>
> -Brian
>
> > 3. root at cloud01:~# ip netns exec
> > qdhcp-3094b264-829b-4381-9ca2-59b3a3fc1ea1 ip a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> > group default qlen 1000
> > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > inet 127.0.0.1/8 scope host lo
> > valid_lft forever preferred_lft forever
> > inet6 ::1/128 scope host
> > valid_lft forever preferred_lft forever
> > 2585: tap1797d9b1-e1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
> > qdisc noqueue state UNKNOWN group default qlen 1000
> > link/ether fa:16:3e:77:64:0d brd ff:ff:ff:ff:ff:ff
> > inet 169.254.169.254/32 brd 169.254.169.254 scope global
> > tap1797d9b1-e1
> > valid_lft forever preferred_lft forever
> > inet 192.168.0.2/24 brd 192.168.0.255 scope global tap1797d9b1-e1
> > valid_lft forever preferred_lft forever
> > inet6 fe80::a9fe:a9fe/64 scope link dadfailed tentative
> > valid_lft forever preferred_lft forever
> > inet6 fe80::f816:3eff:fe77:640d/64 scope link
> > valid_lft forever preferred_lft forever
> > 4.
> >
> > 5. This blocked dhcp agent to finish sync_state function, and
> > NetworkCache was not updated with subnets of such neutron network
> > 6. During creation of VM assigned to such network, agent does not
> > detect any subnets (see point 2), so he thinks
> > (reload_allocations()) there is no dhcp needed and deletes
> > qdhcp-xxxx namespace, so no DHCP and no Metadata are working on such
> > network since that moment, and after 24h we see connectivity issues.
> > 7. Restart of DHCP agent recreates missing qdhcp-xxxx namespaces, but
> > NetworkCache in dhcp agent is again empty, so creation of VM
> > deletes the qdhcp-xxxx namespace again 🙁
> >
> > Workaround is to remove dhcp agent from that network and add it again.
> > Interestingly, sometimes I need to do it multiple times, because in few
> > cases tap interface in new qdhcp finishes again in dadfailed tentative
> > state. After year in production we have 20 networks out of 60 in such state.
> >
> > We are using kolla-ansible deployment on Ubuntu 20.04, kernel
> > 5.4.0-65-generic. Openstack version Victoria and neutron is in version
> > 17.2.2.dev70.
> >
> >
> > Is that bug in neutron, or is it misconfiguration of OS on our side?
> >
> > I'm locally testing patch which disables ipv6 dad in qdhcp-xxxx
> > namespace (net.ipv6.conf.default.accept_dad = 1), but I'm not sure it is
> > good solution when it comes to other neutron features?
> >
> >
> > Kamil Madáč
> > /Slovensko IT a.s./
> >
More information about the openstack-discuss
mailing list