[openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)
Chuck Carlino
chuckjcarlino at gmail.com
Wed Jan 28 23:59:20 UTC 2015
On 01/28/2015 12:51 PM, Kevin Benton wrote:
>
> If we are going to ignore the IP address changing use-case, can we
> just make the default infinity? Then nobody ever has to worry about
> control plane outages for existing client. 24 hours is way too long to
> be useful anyway.
>
Why would users want to change an active port's IP address anyway? I can
see possible use in changing an inactive port's IP address, but that
wouldn't cause the dhcp issues mentioned here. I worry about setting a
default config value to handle a very unusual use case.
Chuck
> On Jan 28, 2015 12:44 PM, "Salvatore Orlando" <sorlando at nicira.com
> <mailto:sorlando at nicira.com>> wrote:
>
>
>
> On 28 January 2015 at 20:19, Brian Haley <brian.haley at hp.com
> <mailto:brian.haley at hp.com>> wrote:
>
> Hi Kevin,
>
> On 01/28/2015 03:50 AM, Kevin Benton wrote:
> > Hi,
> >
> > Approximately a year and a half ago, the default DHCP lease
> time in Neutron was
> > increased from 120 seconds to 86400 seconds.[1] This was
> done with the goal of
> > reducing DHCP traffic with very little discussion (based on
> what I can see in
> > the review and bug report). While it it does indeed reduce
> DHCP traffic, I don't
> > think any bug reports were filed showing that a 120 second
> lease time resulted
> > in too much traffic or that a jump all of the way to 86400
> seconds was required
> > instead of a value in the same order of magnitude.
> >
> > Why does this matter?
> >
> > Neutron ports can be updated with a new IP address from the
> same subnet or
> > another subnet on the same network. The port update will
> result in anti-spoofing
> > iptables rule changes that immediately stop the old IP
> address from working on
> > the host. This means the host is unreachable for 0-12 hours
> based on the current
> > default lease time without manual intervention[2] (assuming
> half-lease length
> > DHCP renewal attempts).
>
> So I'll first comment on the problem. You're essentially
> "pulling the rug" out
> from under these VMs by changing their IP (and that of their
> router and DHCP/DNS
> server), but you expect they should fail quickly and come
> right back online. In
> a non-Neutron environment wouldn't the IT person that did this
> need some pretty
> good heat-resistant pants for all the flames from pissed-off
> users? Sure, the
> guy on his laptop will just bounce the connection, but servers
> (aka VMs) should
> stay pretty static. VMs are servers (and cows according to some).
>
>
> I actually expect this kind operation to not be one Neutron users
> will do very often, mostly because regardless of whether you're in
> the cloud or not, you'd still need to wear those heat resistant pants.
>
>
> The correct solution is to be able to renumber the network so
> there is no issue
> with the anti-spoofing rules dropping packets, or the VMs
> having an unreachable
> IP address, but that's a much bigger nut to crack.
>
>
> Indeed. In my opinion the "update IP" operation sets false
> expectations in users. I have considered disallowing PUT on
> fixed_ips in the past but that did not go ahead because there were
> users leveraging it.
>
>
> > Why is this on the mailing list?
> >
> > In an attempt to make the VMs usable in a much shorter
> timeframe following a
> > Neutron port address change, I submitted a patch to reduce
> the default DHCP
> > lease time to 8 minutes.[3] However, this was upsetting to
> several people,[4] so
> > it was suggested I bring this discussion to the mailing
> list. The following are
> > the high-level concerns followed by my responses:
> >
> > * 8 minutes is arbitrary
> > o Yes, but it's no more arbitrary than 1440 minutes. I
> picked it as an
> > interval because it is still 4 times larger than the last short value,
> > but it still allows VMs to regain connectivity in <5
> minutes in the
> > event their IP is changed. If someone has a good
> suggestion for another
> > interval based on known dnsmasq QPS limits or some
> other quantitative
> > reason, please chime in here.
>
> We run 48 hours as the default in our public cloud, and I did
> some digging to
> remind myself of the multiple reasons:
>
> 1. Too much DHCP traffic. Sure, only that initial request is
> broadcast, but
> dnsmasq is very verbose and loves writing to syslog for
> everything it does -
> less is more. Do a scale test with 10K VMs and you'll quickly
> find out a large
> portion of traffic is DHCP RENEWs, and syslog is huge.
>
>
> This is correct, and something I overlooked in my previous post.
> Nevertheless I still think that it is really impossible to find an
> optimal default which is regarded as such by every user. The
> current default has been chosen mostly for the reason you explain
> below, and I don't see a strong reason for changing it.
>
>
> 2. During a control-plane upgrade or outage, having a short
> DHCP lease time will
> take all your VMs offline. The old value of 2 minutes is not
> a realistic value
> for an upgrade, and I don't think 8 minutes is much better.
> Yes, when DHCP is
> down you can't boot a new VM, but as long as customers can get
> to their existing
> VMs they're pretty happy and won't scream bloody murder.
>
>
> In our cloud we were continuously hit bit this. We could not take
> our dhcp agents out, otherwise all VMs would lose their leases,
> unless the downtime of the agent was very brief.
>
>
> There's probably more, but those were the top two, with #2
> being most important.
>
>
> Summarizing, I think that Kevin is exposing a real, albeit
> well-know problem (sorry about my dhcp release faux pas - I can
> use jet lag as a justification!), and he's proposing a mitigation
> to it. On the other hand, this mitigation, as Brian explains, is
> going to cause real operational issues. Still, we're arguing on
> the a default value for a configuration parameter. I therefore
> think the best thing that we can do is explicitly stating what
> happens when setting long or short lease times.
> I expected this to be documented in [1], but it's not. I think
> that place and neutron.conf might contain this kind of
> documentation, such as:
>
> # DHCP Lease duration (in seconds).
> # Use -1 to tell dnsmasq to use infinite lease times.
> # dhcp_lease_duration = 86400
> # Note that long DHCP leases will result in delays
> # in instances acquiring updated IP addresses. This
> # may result in downtime for those instance as anti
> # spoof policy will then block all traffic in and out of
> # them. In order to minimise this downtime window
> # the lease time should be shorter, for example
> # dhcp_lease_duration = 480
>
> However, I would not change the current system default, as this
> might affect operational systems.
>
> Apologies again for my stupid dhcp-release note,
> Salvatore
>
> [1] http://developer.openstack.org/api-ref-networking-v2.html
>
>
> > * other datacenters use long lease times
> > o This is true, but it's not really a valid
> comparison. In most regular
> > datacenters, updating a static DHCP lease has no effect on the data
> > plane so it doesn't matter that the client doesn't
> react for hours/days
> > (even with DHCP snooping enabled). However, in
> Neutron's case, the
> > security groups are immediately updated so all
> traffic using the old
> > address is blocked.
>
> Yes, and choosing the lease time is a deployment decision that
> needs to take a
> lot of things into account. Like I said, we don't even use
> the default. The
> default should just be a good guess for a standard deployment,
> not a value that
> caters towards the edge cases, especially when the value is
> tunable in neutron.conf.
>
> > * dhcp traffic is scary because it's broadcast
> > o ARP traffic is also broadcast and many clients will
> expire entries every
> > 5-10 minutes and re-ARP. L2population may be used to prevent ARP
> > propagation, so the comparison between DHCP and ARP
> isn't always
> > relevant here.
>
> I don't recall anyone being scared of broadcast, and can't
> find any comments
> regarding it in https://review.openstack.org/#/c/150595/
>
> > Please reply back with your opinions/anecdotes/data related
> to short DHCP lease
> > times.
>
> I can only speculate on why 24 hours was chosen as the default
> back in 2013,
> possibly because a lot of wireless router firmware defaults
> are set as such?
>
> > 1.
> https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
> > 2. Manual intervention could be an instance reboot, a dhcp
> client invocation via
> > the console, or a delayed invocation right before the
> update. (all significantly
> > more difficult to script than a simple update of a port's IP
> via the API).
> > 3. https://review.openstack.org/#/c/150595/
> > 4. http://i.imgur.com/xtvatkP.jpg
>
> I was a much bigger baby than that :)
>
> -Brian
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150128/49ce2eba/attachment.html>
More information about the OpenStack-dev
mailing list