[openstack-dev] [neutron] Re: dhcp 'Address already in use' errors when trying to start a dnsmasq

Ihar Hrachyshka ihrachys at redhat.com
Tue Sep 27 18:27:53 UTC 2016


I wish I did not need to write such emails late in the evening. Added  
missing [neutron] tag.

Ihar Hrachyshka <ihrachys at redhat.com> wrote:

> Hi all,
>
> so we started getting ‘Address already in use’ when trying to start  
> dnsmasq after the previous instance of the process is killed with kill  
> -9. Armando spotted it today in logs for:  
> https://review.openstack.org/#/c/377626/ but as per logstash it seems  
> like an error we saw before (the earliest I see is 9/20), f.e.:
>
> http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz
>
> Assuming I understand the flow of the failure, it runs as follows:
>
> - sync_state starts dnsmasq per network;
> - after agent lock is freed, some other notification event  
> (port_update/subnet_update/...) triggers restart for one of the processes;
> - the restart is done not via reload_allocations (-SIGHUP) but thru  
> restart/disable (kill -9);
> - once the old dnsmasq is killed with -9, we attempt to start a new  
> process with new config files generated and fail with: “dnsmasq: failed  
> to create listening socket for 10.1.15.242: Address already in use”
> - surprisingly, after several failing attempts to start the process, it  
> succeeds to start it after a bunch of seconds and runs fine.
>
> It looks like once we kill the process with -9, it may hold for the  
> socket resource for some time and may clash with the new process we try  
> to spawn. It’s a bit weird because dnsmasq should have set REUSEADDR for  
> the socket, so a new process should have started just fine.
>
> Lately, we landed several patches that touched reload logic for DHCP  
> agent on notifications. Among those suspicious in the context are:
>
> - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’  
> (-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it  
> also does not unplug the port on lease release (maybe after we rip of the  
> device, the address clash with the old dnsmasq state is gone even though  
> the ’new’ port will use the same address?).
> - https://review.openstack.org/#/c/372236/6 - we were requesting  
> reload_allocations in some cases before, and now we put the network into  
> resync queue
>
> There were other related changes lately, you can check history of Kevin’s  
> changes for the branch, it should capture most of them.
>
> I wonder whether we hit some long standing restart issue with dnsmasq  
> here that was just never triggered before because we were not calling  
> kill -9 so eagerly as we do now.
>
> Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT  
> result in the failure we see in gate logs.
>
> We need to understand what’s going with the failure, and come up with  
> some plan for Newton. We either revert suspected patches as I believe  
> Armando proposed before, but then it’s not clear until which point to do  
> it; or we come up with some smart fix for that, that I don’t immediately  
> grasp.
>
> I will be on vacation tomorrow, though I will check the email thread to  
> see if we have a plan to act on. I really hope folks give the issue a  
> priority since it seems like we buried ourselves under a pile of  
> interleaved patches and now we don’t have a clear view of how to get out  
> of the pile.
>
> Cheers,
> Ihar





More information about the OpenStack-dev mailing list