[neutron] DNS resolution delay after VM launch
Hi, I'm trying to resolve an issue that happens about 30% of the time. DVR is being used in this environment. The environment is deployed with Kolla Ansible on CentOS 7 using Stein (the latest Kolla Ansible installer). Physical servers have the latest CentOS 7 patches through November 30th, 2019. This seems to happen on new and existing projects, including newly created subnets, and, so far, it seems to be only related to VM launches. Delaying the time between a subnet creation and the VM launch makes no difference in the behavior. When the problem occurs, after a VM is launched, it takes a longer-than-normal time to boot. This would usually indicate a DNS issue or possible a DHCP issue. After the OS boots (after 60 seconds or so), I can login, but with a delayed login (another 20 seconds or so) due to the DNS issue I'm diagnosing. DHCP works fine - no issues with IP assignment, and the /etc/resolv.conf entries are set to the routers' two private IPs. No DHCP errors in the dmesg output. I can connect to both respective qdhcp network namespaces where dnsmasq is running for the two distributed routers, immediately after the subnet is created (in our test case, it is 192.168.99.0/24), and can resolve names using "nslookup <name> 192.168.99.1" and "nslookup <name> 192.168.99.2" instantly. So, dnsmasq is working properly. A floating IP is assigned to this VM, which is used for the following SSH sessions. After SSH'ing to the VM, I can ping the 192.168.99.1 and 192.168.99.2 addresses, so it does not appear to be a network issue (ICMP works), unless there is a firewall rule either in iptables or OVS that is blocking UDP/TCP port 53 for DNS requests, which would be odd, since the egress security group assigned to this VM is unrestricted. I can also ping external IPs from the VM (Internet-accessible IPs), so the routers are working. I can also perform nslookups against external DNS servers without issue, such as "nslookup <name> 1.1.1.1". If I direct nslookup to the internal DNS servers using "nslookup <name> 192.168.99.1", the lookup times out. We normally use CentOS images, but I tried an Ubuntu 19.04 image and the same problem occurs often. Now, what is ODD - the problem goes away after about 3 minutes! DNS lookups against the internal DNS servers (the router IPs) start working perfectly. So something is taking a while to get configured, but only about 30% of the time. If, during the creation of the subnet, we specify --dns-nameserver options, such as: --dns-nameserver 1.1.1.1 the issue NEVER occurs. So it seems to be limited to DNS traffic between the VM and the dnsmasq service in the qdhcp namespace. All traffic in this environment is switched (we are not routing VXLAN traffic, for example), so there can be no physical network interference at Layers 3 or higher. The problem has also occurred on multiple hosts in the environment (the VM has been launched on various hosts with the same issue). Any ideas how to diagnose this? I'm assuming it has something to do with the iptables or OVS configuration, but if someone knows specifically what could cause this, such as a specific rule that is supposed to be created for DNS traffic to the internal routers, I could dive into the these configurations. Eric
participants (1)
-
Eric K. Miller