Open Stack

Wed Jan 28 19:19:01 UTC 2015

Hi Kevin,

On 01/28/2015 03:50 AM, Kevin Benton wrote:
> Hi,
> 
> Approximately a year and a half ago, the default DHCP lease time in Neutron was
> increased from 120 seconds to 86400 seconds.[1] This was done with the goal of
> reducing DHCP traffic with very little discussion (based on what I can see in
> the review and bug report). While it it does indeed reduce DHCP traffic, I don't
> think any bug reports were filed showing that a 120 second lease time resulted
> in too much traffic or that a jump all of the way to 86400 seconds was required
> instead of a value in the same order of magnitude.
> 
> Why does this matter? 
> 
> Neutron ports can be updated with a new IP address from the same subnet or
> another subnet on the same network. The port update will result in anti-spoofing
> iptables rule changes that immediately stop the old IP address from working on
> the host. This means the host is unreachable for 0-12 hours based on the current
> default lease time without manual intervention[2] (assuming half-lease length
> DHCP renewal attempts).

So I'll first comment on the problem.  You're essentially "pulling the rug" out
from under these VMs by changing their IP (and that of their router and DHCP/DNS
server), but you expect they should fail quickly and come right back online.  In
a non-Neutron environment wouldn't the IT person that did this need some pretty
good heat-resistant pants for all the flames from pissed-off users?  Sure, the
guy on his laptop will just bounce the connection, but servers (aka VMs) should
stay pretty static.  VMs are servers (and cows according to some).

The correct solution is to be able to renumber the network so there is no issue
with the anti-spoofing rules dropping packets, or the VMs having an unreachable
IP address, but that's a much bigger nut to crack.

> Why is this on the mailing list?
> 
> In an attempt to make the VMs usable in a much shorter timeframe following a
> Neutron port address change, I submitted a patch to reduce the default DHCP
> lease time to 8 minutes.[3] However, this was upsetting to several people,[4] so
> it was suggested I bring this discussion to the mailing list. The following are
> the high-level concerns followed by my responses:
> 
>   * 8 minutes is arbitrary
>       o Yes, but it's no more arbitrary than 1440 minutes. I picked it as an
>         interval because it is still 4 times larger than the last short value,
>         but it still allows VMs to regain connectivity in <5 minutes in the
>         event their IP is changed. If someone has a good suggestion for another
>         interval based on known dnsmasq QPS limits or some other quantitative
>         reason, please chime in here.

We run 48 hours as the default in our public cloud, and I did some digging to
remind myself of the multiple reasons:

1. Too much DHCP traffic.  Sure, only that initial request is broadcast, but
dnsmasq is very verbose and loves writing to syslog for everything it does -
less is more.  Do a scale test with 10K VMs and you'll quickly find out a large
portion of traffic is DHCP RENEWs, and syslog is huge.

2. During a control-plane upgrade or outage, having a short DHCP lease time will
take all your VMs offline.  The old value of 2 minutes is not a realistic value
for an upgrade, and I don't think 8 minutes is much better.  Yes, when DHCP is
down you can't boot a new VM, but as long as customers can get to their existing
VMs they're pretty happy and won't scream bloody murder.

There's probably more, but those were the top two, with #2 being most important.

>   * other datacenters use long lease times
>       o This is true, but it's not really a valid comparison. In most regular
>         datacenters, updating a static DHCP lease has no effect on the data
>         plane so it doesn't matter that the client doesn't react for hours/days
>         (even with DHCP snooping enabled). However, in Neutron's case, the
>         security groups are immediately updated so all traffic using the old
>         address is blocked.

Yes, and choosing the lease time is a deployment decision that needs to take a
lot of things into account.  Like I said, we don't even use the default.  The
default should just be a good guess for a standard deployment, not a value that
caters towards the edge cases, especially when the value is tunable in neutron.conf.

>   * dhcp traffic is scary because it's broadcast
>       o ARP traffic is also broadcast and many clients will expire entries every
>         5-10 minutes and re-ARP. L2population may be used to prevent ARP
>         propagation, so the comparison between DHCP and ARP isn't always
>         relevant here.

I don't recall anyone being scared of broadcast, and can't find any comments
regarding it in https://review.openstack.org/#/c/150595/

> Please reply back with your opinions/anecdotes/data related to short DHCP lease
> times.

I can only speculate on why 24 hours was chosen as the default back in 2013,
possibly because a lot of wireless router firmware defaults are set as such?

> 1. https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
> 2. Manual intervention could be an instance reboot, a dhcp client invocation via
> the console, or a delayed invocation right before the update. (all significantly
> more difficult to script than a simple update of a port's IP via the API).
> 3. https://review.openstack.org/#/c/150595/
> 4. http://i.imgur.com/xtvatkP.jpg

I was a much bigger baby than that :)

-Brian

Open Stack

[openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

OpenStack

Community

Documentation

Branding & Legal