[openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)
Miguel Ángel Ajo
majopela at redhat.com
Wed Jan 28 09:05:06 UTC 2015
Miguel Ángel Ajo
On Wednesday, 28 de January de 2015 at 09:50, Kevin Benton wrote:
> Hi,
>
> Approximately a year and a half ago, the default DHCP lease time in Neutron was increased from 120 seconds to 86400 seconds.[1] This was done with the goal of reducing DHCP traffic with very little discussion (based on what I can see in the review and bug report). While it it does indeed reduce DHCP traffic, I don't think any bug reports were filed showing that a 120 second lease time resulted in too much traffic or that a jump all of the way to 86400 seconds was required instead of a value in the same order of magnitude.
>
> Why does this matter?
>
> Neutron ports can be updated with a new IP address from the same subnet or another subnet on the same network. The port update will result in anti-spoofing iptables rule changes that immediately stop the old IP address from working on the host. This means the host is unreachable for 0-12 hours based on the current default lease time without manual intervention[2] (assuming half-lease length DHCP renewal attempts).
>
> Why is this on the mailing list?
>
> In an attempt to make the VMs usable in a much shorter timeframe following a Neutron port address change, I submitted a patch to reduce the default DHCP lease time to 8 minutes.[3] However, this was upsetting to several people,[4] so it was suggested I bring this discussion to the mailing list. The following are the high-level concerns followed by my responses:
> 8 minutes is arbitrary
> Yes, but it's no more arbitrary than 1440 minutes. I picked it as an interval because it is still 4 times larger than the last short value, but it still allows VMs to regain connectivity in <5 minutes in the event their IP is changed. If someone has a good suggestion for another interval based on known dnsmasq QPS limits or some other quantitative reason, please chime in here.
>
> other datacenters use long lease times
> This is true, but it's not really a valid comparison. In most regular datacenters, updating a static DHCP lease has no effect on the data plane so it doesn't matter that the client doesn't react for hours/days (even with DHCP snooping enabled). However, in Neutron's case, the security groups are immediately updated so all traffic using the old address is blocked.
>
> dhcp traffic is scary because it's broadcast
> ARP traffic is also broadcast and many clients will expire entries every 5-10 minutes and re-ARP. L2population may be used to prevent ARP propagation, so the comparison between DHCP and ARP isn't always relevant here.
>
>
>
>
For what I’ve seen, at least for linux, the first DHCP request will be broadcast. Then all lease renewals are unicast, unless, the original
DHCP can’t be contacted, in which case, the dhcp client will turn back to broadcast trying to find out another server to renew his lease.
So, only initial boot of an instance should generate broadcast traffic.
Your proposal seems reasonable to me.
In this context, please see this ongoing work [5], specially comments here [6], where we’re discussing about optimization,
due to theoretical 120 second limit for renews at scale, and we made some calculations of CPU usage for the current default, I
will recalculate those for the new proposed default: 8 minutes.
TL; DR.
That patch fixes an issue found when you restart dnsmasq, and old leases can’t be renewed, so we end up in a storm of requests,
for that we need to provide dnsmasq with a script for initialization of the leases table, initially such script was provided in python,
but that means that script is called for: init (once), lease (once per instance), and renew (every lease renew time * number of instances),
thus we should minimize the impact of such script as much as possible, or contribute dnsmasq to avoid such script being called
for lease renews under some flag.
>
> Please reply back with your opinions/anecdotes/data related to short DHCP lease times.
>
> Cheers
>
> 1. https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
> 2. Manual intervention could be an instance reboot, a dhcp client invocation via the console, or a delayed invocation right before the update. (all significantly more difficult to script than a simple update of a port's IP via the API).
> 3. https://review.openstack.org/#/c/150595/
> 4. http://i.imgur.com/xtvatkP.jpg
>
>
>
>
5. https://review.openstack.org/#/c/108272/ (https://review.openstack.org/#/c/108272/8/neutron/agent/linux/dhcp.py)
6. https://review.openstack.org/#/c/108272/8/neutron/agent/linux/dhcp.py
>
> --
> Kevin Benton
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe (mailto:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe)
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150128/3c23827f/attachment.html>
More information about the OpenStack-dev
mailing list