Eric K. Miller emiller at genesishosting.com
Sun Dec 1 22:39:13 UTC 2019



I'm trying to resolve an issue that happens about 30% of the time.  DVR
is being used in this environment.  The environment is deployed with
Kolla Ansible on CentOS 7 using Stein (the latest Kolla Ansible
installer).  Physical servers have the latest CentOS 7 patches through
November 30th, 2019.


This seems to happen on new and existing projects, including newly
created subnets, and, so far, it seems to be only related to VM
launches.  Delaying the time between a subnet creation and the VM launch
makes no difference in the behavior.


When the problem occurs, after a VM is launched, it takes a
longer-than-normal time to boot.  This would usually indicate a DNS
issue or possible a DHCP issue.


After the OS boots (after 60 seconds or so), I can login, but with a
delayed login (another 20 seconds or so) due to the DNS issue I'm


DHCP works fine - no issues with IP assignment, and the /etc/resolv.conf
entries are set to the routers' two private IPs.  No DHCP errors in the
dmesg output.


I can connect to both respective qdhcp network namespaces where dnsmasq
is running for the two distributed routers, immediately after the subnet
is created (in our test case, it is, and can resolve
names using "nslookup <name>" and "nslookup <name>" instantly.  So, dnsmasq is working properly.


A floating IP is assigned to this VM, which is used for the following
SSH sessions.


After SSH'ing to the VM, I can ping the and
addresses, so it does not appear to be a network issue (ICMP works),
unless there is a firewall rule either in iptables or OVS that is
blocking UDP/TCP port 53 for DNS requests, which would be odd, since the
egress security group assigned to this VM is unrestricted.


I can also ping external IPs from the VM (Internet-accessible IPs), so
the routers are working.


I can also perform nslookups against external DNS servers without issue,
such as "nslookup <name>".  If I direct nslookup to the internal
DNS servers using "nslookup <name>", the lookup times out.


We normally use CentOS images, but I tried an Ubuntu 19.04 image and the
same problem occurs often.


Now, what is ODD - the problem goes away after about 3 minutes!  DNS
lookups against the internal DNS servers (the router IPs) start working
perfectly.  So something is taking a while to get configured, but only
about 30% of the time.


If, during the creation of the subnet, we specify --dns-nameserver
options, such as:


the issue NEVER occurs.  So it seems to be limited to DNS traffic
between the VM and the dnsmasq service in the qdhcp namespace.


All traffic in this environment is switched (we are not routing VXLAN
traffic, for example), so there can be no physical network interference
at Layers 3 or higher.


The problem has also occurred on multiple hosts in the environment (the
VM has been launched on various hosts with the same issue).


Any ideas how to diagnose this?  I'm assuming it has something to do
with the iptables or OVS configuration, but if someone knows
specifically what could cause this, such as a specific rule that is
supposed to be created for DNS traffic to the internal routers, I could
dive into the these configurations.




