[openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

Armando M. armamig at gmail.com
Tue Sep 27 18:35:21 UTC 2016


On 27 September 2016 at 11:29, Miguel Angel Ajo Pelayo <majopela at redhat.com>
wrote:

> Ack, and thanks for the summary Ihar,
>
> I will have a look on it tomorrow morning, please update this thread
> with any progress.
>
>
>
> On Tue, Sep 27, 2016 at 8:22 PM, Ihar Hrachyshka <ihrachys at redhat.com>
> wrote:
> > Hi all,
> >
> > so we started getting ‘Address already in use’ when trying to start
> dnsmasq
> > after the previous instance of the process is killed with kill -9.
> Armando
> > spotted it today in logs for: https://review.openstack.org/#/c/377626/
> but
> > as per logstash it seems like an error we saw before (the earliest I see
> is
> > 9/20), f.e.:
> >
> > http://logs.openstack.org/26/377626/1/check/gate-tempest-dsv
> m-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz
> >
> > Assuming I understand the flow of the failure, it runs as follows:
> >
> > - sync_state starts dnsmasq per network;
> > - after agent lock is freed, some other notification event
> > (port_update/subnet_update/...) triggers restart for one of the
> processes;
> > - the restart is done not via reload_allocations (-SIGHUP) but thru
> > restart/disable (kill -9);
> > - once the old dnsmasq is killed with -9, we attempt to start a new
> process
> > with new config files generated and fail with: “dnsmasq: failed to create
> > listening socket for 10.1.15.242: Address already in use”
> > - surprisingly, after several failing attempts to start the process, it
> > succeeds to start it after a bunch of seconds and runs fine.
> >
> > It looks like once we kill the process with -9, it may hold for the
> socket
> > resource for some time and may clash with the new process we try to
> spawn.
> > It’s a bit weird because dnsmasq should have set REUSEADDR for the
> socket,
> > so a new process should have started just fine.
> >
> > Lately, we landed several patches that touched reload logic for DHCP
> agent
> > on notifications. Among those suspicious in the context are:
> >
> > - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’
> (-9)
> > where it was using ‘reload_allocations’ (-SIGHUP) before, and it also
> does
> > not unplug the port on lease release (maybe after we rip of the device,
> the
> > address clash with the old dnsmasq state is gone even though the ’new’
> port
> > will use the same address?).
> > - https://review.openstack.org/#/c/372236/6 - we were requesting
> > reload_allocations in some cases before, and now we put the network into
> > resync queue
> >
> > There were other related changes lately, you can check history of Kevin’s
> > changes for the branch, it should capture most of them.
> >
> > I wonder whether we hit some long standing restart issue with dnsmasq
> here
> > that was just never triggered before because we were not calling kill -9
> so
> > eagerly as we do now.
> >
> > Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT
> > result in the failure we see in gate logs.
> >
> > We need to understand what’s going with the failure, and come up with
> some
> > plan for Newton. We either revert suspected patches as I believe Armando
> > proposed before, but then it’s not clear until which point to do it; or
> we
> > come up with some smart fix for that, that I don’t immediately grasp.
> >
> > I will be on vacation tomorrow, though I will check the email thread to
> see
> > if we have a plan to act on. I really hope folks give the issue a
> priority
> > since it seems like we buried ourselves under a pile of interleaved
> patches
> > and now we don’t have a clear view of how to get out of the pile.
>

Personally I feel there is no time left for us to do anything about this in
RC2. Nothing at this point is going to guarantee that another patch is not
gonna lead us to new potential ripple effects. I am personally okay to cut
RC2 as it stands, and let downstream players have some time vetting the
build and give us a chance to fix one more last minute "disaster".

Rest assured we'll learn from this mistake.

A.

>
> > Cheers,
> > Ihar
> >
> > ____________________________________________________________
> ______________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.op
> enstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160927/bc2f6f94/attachment.html>


More information about the OpenStack-dev mailing list