[openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

Chuck Carlino chuckjcarlino at gmail.com
Wed Jan 28 23:59:20 UTC 2015


On 01/28/2015 12:51 PM, Kevin Benton wrote:
>
> If we are going to ignore the IP address changing use-case, can we 
> just make the default infinity? Then nobody ever has to worry about 
> control plane outages for existing client. 24 hours is way too long to 
> be useful anyway.
>

Why would users want to change an active port's IP address anyway? I can 
see possible use in changing an inactive port's IP address, but that 
wouldn't cause the dhcp issues mentioned here.  I worry about setting a 
default config value to handle a very unusual use case.

Chuck


> On Jan 28, 2015 12:44 PM, "Salvatore Orlando" <sorlando at nicira.com 
> <mailto:sorlando at nicira.com>> wrote:
>
>
>
>     On 28 January 2015 at 20:19, Brian Haley <brian.haley at hp.com
>     <mailto:brian.haley at hp.com>> wrote:
>
>         Hi Kevin,
>
>         On 01/28/2015 03:50 AM, Kevin Benton wrote:
>         > Hi,
>         >
>         > Approximately a year and a half ago, the default DHCP lease
>         time in Neutron was
>         > increased from 120 seconds to 86400 seconds.[1] This was
>         done with the goal of
>         > reducing DHCP traffic with very little discussion (based on
>         what I can see in
>         > the review and bug report). While it it does indeed reduce
>         DHCP traffic, I don't
>         > think any bug reports were filed showing that a 120 second
>         lease time resulted
>         > in too much traffic or that a jump all of the way to 86400
>         seconds was required
>         > instead of a value in the same order of magnitude.
>         >
>         > Why does this matter?
>         >
>         > Neutron ports can be updated with a new IP address from the
>         same subnet or
>         > another subnet on the same network. The port update will
>         result in anti-spoofing
>         > iptables rule changes that immediately stop the old IP
>         address from working on
>         > the host. This means the host is unreachable for 0-12 hours
>         based on the current
>         > default lease time without manual intervention[2] (assuming
>         half-lease length
>         > DHCP renewal attempts).
>
>         So I'll first comment on the problem.  You're essentially
>         "pulling the rug" out
>         from under these VMs by changing their IP (and that of their
>         router and DHCP/DNS
>         server), but you expect they should fail quickly and come
>         right back online.  In
>         a non-Neutron environment wouldn't the IT person that did this
>         need some pretty
>         good heat-resistant pants for all the flames from pissed-off
>         users?  Sure, the
>         guy on his laptop will just bounce the connection, but servers
>         (aka VMs) should
>         stay pretty static.  VMs are servers (and cows according to some).
>
>
>     I actually expect this kind operation to not be one Neutron users
>     will do very often, mostly because regardless of whether you're in
>     the cloud or not, you'd still need to wear those heat resistant pants.
>
>
>         The correct solution is to be able to renumber the network so
>         there is no issue
>         with the anti-spoofing rules dropping packets, or the VMs
>         having an unreachable
>         IP address, but that's a much bigger nut to crack.
>
>
>     Indeed. In my opinion the "update IP" operation sets false
>     expectations in users. I have considered disallowing PUT on
>     fixed_ips in the past but that did not go ahead because there were
>     users leveraging it.
>
>
>         > Why is this on the mailing list?
>         >
>         > In an attempt to make the VMs usable in a much shorter
>         timeframe following a
>         > Neutron port address change, I submitted a patch to reduce
>         the default DHCP
>         > lease time to 8 minutes.[3] However, this was upsetting to
>         several people,[4] so
>         > it was suggested I bring this discussion to the mailing
>         list. The following are
>         > the high-level concerns followed by my responses:
>         >
>         >   * 8 minutes is arbitrary
>         >       o Yes, but it's no more arbitrary than 1440 minutes. I
>         picked it as an
>         >         interval because it is still 4 times larger than the last short value,
>         >         but it still allows VMs to regain connectivity in <5
>         minutes in the
>         >         event their IP is changed. If someone has a good
>         suggestion for another
>         >         interval based on known dnsmasq QPS limits or some
>         other quantitative
>         >         reason, please chime in here.
>
>         We run 48 hours as the default in our public cloud, and I did
>         some digging to
>         remind myself of the multiple reasons:
>
>         1. Too much DHCP traffic.  Sure, only that initial request is
>         broadcast, but
>         dnsmasq is very verbose and loves writing to syslog for
>         everything it does -
>         less is more.  Do a scale test with 10K VMs and you'll quickly
>         find out a large
>         portion of traffic is DHCP RENEWs, and syslog is huge.
>
>
>     This is correct, and something I overlooked in my previous post.
>     Nevertheless I still think that it is really impossible to find an
>     optimal default which is regarded as such by every user. The
>     current default has been chosen mostly for the reason you explain
>     below, and I don't see a strong reason for changing it.
>
>
>         2. During a control-plane upgrade or outage, having a short
>         DHCP lease time will
>         take all your VMs offline.  The old value of 2 minutes is not
>         a realistic value
>         for an upgrade, and I don't think 8 minutes is much better. 
>         Yes, when DHCP is
>         down you can't boot a new VM, but as long as customers can get
>         to their existing
>         VMs they're pretty happy and won't scream bloody murder.
>
>
>     In our cloud we were continuously hit bit this. We could not take
>     our dhcp agents out, otherwise all VMs would lose their leases,
>     unless the downtime of the agent was very brief.
>
>
>         There's probably more, but those were the top two, with #2
>         being most important.
>
>
>     Summarizing, I think that Kevin is exposing a real, albeit
>     well-know problem (sorry about my dhcp release faux pas - I can
>     use jet lag as a justification!), and he's proposing a mitigation
>     to it. On the other hand, this mitigation, as Brian explains, is
>     going to cause real operational issues. Still, we're arguing on
>     the a default value for a configuration parameter. I therefore
>     think the best thing that we can do is explicitly stating what
>     happens when setting long or short lease times.
>     I expected this to be documented in [1], but it's not. I think
>     that place and neutron.conf might contain this kind of
>     documentation, such as:
>
>     # DHCP Lease duration (in seconds).
>     # Use -1 to tell dnsmasq to use infinite lease times.
>     # dhcp_lease_duration = 86400
>     # Note that long DHCP leases will result in delays
>     # in instances acquiring updated IP addresses. This
>     # may result in downtime for those instance as anti
>     # spoof policy will then block all traffic in and out of
>     # them. In order to minimise this downtime window
>     # the lease time should be shorter, for example
>     # dhcp_lease_duration = 480
>
>     However, I would not change the current system default, as this
>     might affect operational systems.
>
>     Apologies again for my stupid dhcp-release note,
>     Salvatore
>
>     [1] http://developer.openstack.org/api-ref-networking-v2.html
>
>
>         >   * other datacenters use long lease times
>         >       o This is true, but it's not really a valid
>         comparison. In most regular
>         >         datacenters, updating a static DHCP lease has no effect on the data
>         >         plane so it doesn't matter that the client doesn't
>         react for hours/days
>         >         (even with DHCP snooping enabled). However, in
>         Neutron's case, the
>         >         security groups are immediately updated so all
>         traffic using the old
>         >         address is blocked.
>
>         Yes, and choosing the lease time is a deployment decision that
>         needs to take a
>         lot of things into account.  Like I said, we don't even use
>         the default.  The
>         default should just be a good guess for a standard deployment,
>         not a value that
>         caters towards the edge cases, especially when the value is
>         tunable in neutron.conf.
>
>         >   * dhcp traffic is scary because it's broadcast
>         >       o ARP traffic is also broadcast and many clients will
>         expire entries every
>         >         5-10 minutes and re-ARP. L2population may be used to prevent ARP
>         >         propagation, so the comparison between DHCP and ARP
>         isn't always
>         >         relevant here.
>
>         I don't recall anyone being scared of broadcast, and can't
>         find any comments
>         regarding it in https://review.openstack.org/#/c/150595/
>
>         > Please reply back with your opinions/anecdotes/data related
>         to short DHCP lease
>         > times.
>
>         I can only speculate on why 24 hours was chosen as the default
>         back in 2013,
>         possibly because a lot of wireless router firmware defaults
>         are set as such?
>
>         > 1.
>         https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
>         > 2. Manual intervention could be an instance reboot, a dhcp
>         client invocation via
>         > the console, or a delayed invocation right before the
>         update. (all significantly
>         > more difficult to script than a simple update of a port's IP
>         via the API).
>         > 3. https://review.openstack.org/#/c/150595/
>         > 4. http://i.imgur.com/xtvatkP.jpg
>
>         I was a much bigger baby than that :)
>
>         -Brian
>
>         __________________________________________________________________________
>         OpenStack Development Mailing List (not for usage questions)
>         Unsubscribe:
>         OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>     <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150128/49ce2eba/attachment.html>


More information about the OpenStack-dev mailing list