[openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

Ihar Hrachyshka ihrachys at redhat.com
Tue Sep 27 18:22:14 UTC 2016


Hi all,

so we started getting ‘Address already in use’ when trying to start dnsmasq  
after the previous instance of the process is killed with kill -9. Armando  
spotted it today in logs for: https://review.openstack.org/#/c/377626/ but  
as per logstash it seems like an error we saw before (the earliest I see is  
9/20), f.e.:

http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz

Assuming I understand the flow of the failure, it runs as follows:

- sync_state starts dnsmasq per network;
- after agent lock is freed, some other notification event  
(port_update/subnet_update/...) triggers restart for one of the processes;
- the restart is done not via reload_allocations (-SIGHUP) but thru  
restart/disable (kill -9);
- once the old dnsmasq is killed with -9, we attempt to start a new process  
with new config files generated and fail with: “dnsmasq: failed to create  
listening socket for 10.1.15.242: Address already in use”
- surprisingly, after several failing attempts to start the process, it  
succeeds to start it after a bunch of seconds and runs fine.

It looks like once we kill the process with -9, it may hold for the socket  
resource for some time and may clash with the new process we try to spawn.  
It’s a bit weird because dnsmasq should have set REUSEADDR for the socket,  
so a new process should have started just fine.

Lately, we landed several patches that touched reload logic for DHCP agent  
on notifications. Among those suspicious in the context are:

- https://review.openstack.org/#/c/372595/ - note it requests ‘disable’  
(-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also  
does not unplug the port on lease release (maybe after we rip of the  
device, the address clash with the old dnsmasq state is gone even though  
the ’new’ port will use the same address?).
- https://review.openstack.org/#/c/372236/6 - we were requesting  
reload_allocations in some cases before, and now we put the network into  
resync queue

There were other related changes lately, you can check history of Kevin’s  
changes for the branch, it should capture most of them.

I wonder whether we hit some long standing restart issue with dnsmasq here  
that was just never triggered before because we were not calling kill -9 so  
eagerly as we do now.

Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT  
result in the failure we see in gate logs.

We need to understand what’s going with the failure, and come up with some  
plan for Newton. We either revert suspected patches as I believe Armando  
proposed before, but then it’s not clear until which point to do it; or we  
come up with some smart fix for that, that I don’t immediately grasp.

I will be on vacation tomorrow, though I will check the email thread to see  
if we have a plan to act on. I really hope folks give the issue a priority  
since it seems like we buried ourselves under a pile of interleaved patches  
and now we don’t have a clear view of how to get out of the pile.

Cheers,
Ihar



More information about the OpenStack-dev mailing list