<div dir="ltr"><div>><span style="font-size:12.8000001907349px">Why would users want to change an active port's IP address anyway?</span></div><div><span style="font-size:12.8000001907349px"><br></span></div><div><span style="font-size:12.8000001907349px">Re-addressing. It's not common, but the entire reason I brought this up is because a user was moving an instance to another subnet on the same network and stranded one of their VMs.</span></div><div><span style="font-size:12.8000001907349px"><br></span></div><div><span style="font-size:12.8000001907349px">></span><span style="font-size:12.8000001907349px"> </span><span style="font-size:12.8000001907349px">I worry about setting a default config value to handle a very unusual use case.</span></div><div><span style="font-size:12.8000001907349px"><br></span></div><div><span style="font-size:12.8000001907349px">Changing a static lease is something that works on normal networks so I don't think we should break it in Neutron without a really good reason.</span></div><div><span style="font-size:12.8000001907349px"><br></span></div><div><span style="font-size:12.8000001907349px">Right now, the big reason to keep a high lease time that I agree with is that it buys operators lots of dnsmasq downtime without affecting running clients. To get the best of both worlds </span>we can set DHCP option 58 (a.k.a dhcp-renewal-time or T1) to 240 seconds. Then the lease time can be left to be something large like 10 days to allow for tons of DHCP server downtime without affecting running clients.</div><div><br></div><div>There are two issues with this approach. First, some simple dhcp clients don't honor that dhcp option (e.g. the one with Cirros), but it works with dhclient so it should work on CentOS, Fedora, etc (I verified it works on Ubuntu). This isn't a big deal because the worst case is what we have already (half of the lease time). The second issue is that dnsmasq hardcodes that option, so a patch would be required to allow it to be specified in the options file. I am happy to submit the patch required there so that isn't a big deal either.</div><div><br></div><div><br></div><div><div><span style="font-size:12.8000001907349px">If we implement that fix, the remaining issue is Brian's other comment about too much DHCP traffic. </span><span style="font-size:12.8000001907349px">I've been doing some packet captures and the standard request/reply for a renewal is 2 unicast packets totaling about 725 bytes. Assuming 10,000 VMs renewing every 240 seconds, there will be an average of 242 kbps background traffic across the entire network. Even at a density of 50 VMs, that's only 1.2 kbps per</span><span style="font-size:12.8000001907349px"> compute node. If that's still too much, then the deployer can adjust the value upwards, but that's hardly a reason to have a high default.</span></div></div><div><span style="font-size:12.8000001907349px"><br></span></div><div><span style="font-size:12.8000001907349px">That just leaves the logging problem. Since we require a change to dnsmasq anyway, perhaps we could also request an option to suppress logs from renewals? If that's not adequate, I think 2 log entries per vm every 240 seconds is really only a concern for operators with large clouds and they should have the knowledge required to change a config file anyway. ;-)</span><br></div><div><span style="font-size:12.8000001907349px"><br></span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 28, 2015 at 3:59 PM, Chuck Carlino <span dir="ltr"><<a href="mailto:chuckjcarlino@gmail.com" target="_blank">chuckjcarlino@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span class="">
<div>On 01/28/2015 12:51 PM, Kevin Benton
wrote:<br>
</div>
<blockquote type="cite">
<p dir="ltr">If we are going to ignore the IP address changing
use-case, can we just make the default infinity? Then nobody
ever has to worry about control plane outages for existing
client. 24 hours is way too long to be useful anyway. </p>
</blockquote>
<br></span>
Why would users want to change an active port's IP address anyway?
I can see possible use in changing an inactive port's IP address,
but that wouldn't cause the dhcp issues mentioned here. I worry
about setting a default config value to handle a very unusual use
case.<span class="HOEnZb"><font color="#888888"><br>
<br>
Chuck</font></span><div><div class="h5"><br>
<br>
<br>
<blockquote type="cite">
<div class="gmail_quote">On Jan 28, 2015 12:44 PM, "Salvatore
Orlando" <<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>>
wrote:<br type="attribution">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 28 January 2015 at 20:19,
Brian Haley <span dir="ltr"><<a href="mailto:brian.haley@hp.com" target="_blank">brian.haley@hp.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi
Kevin,<br>
<span><br>
On 01/28/2015 03:50 AM, Kevin Benton wrote:<br>
> Hi,<br>
><br>
> Approximately a year and a half ago, the
default DHCP lease time in Neutron was<br>
> increased from 120 seconds to 86400 seconds.[1]
This was done with the goal of<br>
> reducing DHCP traffic with very little
discussion (based on what I can see in<br>
> the review and bug report). While it it does
indeed reduce DHCP traffic, I don't<br>
> think any bug reports were filed showing that a
120 second lease time resulted<br>
> in too much traffic or that a jump all of the
way to 86400 seconds was required<br>
> instead of a value in the same order of
magnitude.<br>
><br>
> Why does this matter?<br>
><br>
> Neutron ports can be updated with a new IP
address from the same subnet or<br>
> another subnet on the same network. The port
update will result in anti-spoofing<br>
> iptables rule changes that immediately stop the
old IP address from working on<br>
> the host. This means the host is unreachable
for 0-12 hours based on the current<br>
> default lease time without manual
intervention[2] (assuming half-lease length<br>
> DHCP renewal attempts).<br>
<br>
</span>So I'll first comment on the problem. You're
essentially "pulling the rug" out<br>
from under these VMs by changing their IP (and that of
their router and DHCP/DNS<br>
server), but you expect they should fail quickly and
come right back online. In<br>
a non-Neutron environment wouldn't the IT person that
did this need some pretty<br>
good heat-resistant pants for all the flames from
pissed-off users? Sure, the<br>
guy on his laptop will just bounce the connection, but
servers (aka VMs) should<br>
stay pretty static. VMs are servers (and cows
according to some).<br>
</blockquote>
<div><br>
</div>
<div>I actually expect this kind operation to not be one
Neutron users will do very often, mostly because
regardless of whether you're in the cloud or not,
you'd still need to wear those heat resistant pants.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>
The correct solution is to be able to renumber the
network so there is no issue<br>
with the anti-spoofing rules dropping packets, or the
VMs having an unreachable<br>
IP address, but that's a much bigger nut to crack.<br>
</blockquote>
<div><br>
</div>
<div>Indeed. In my opinion the "update IP" operation
sets false expectations in users. I have considered
disallowing PUT on fixed_ips in the past but that did
not go ahead because there were users leveraging it.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span><br>
> Why is this on the mailing list?<br>
><br>
> In an attempt to make the VMs usable in a much
shorter timeframe following a<br>
> Neutron port address change, I submitted a
patch to reduce the default DHCP<br>
> lease time to 8 minutes.[3] However, this was
upsetting to several people,[4] so<br>
> it was suggested I bring this discussion to the
mailing list. The following are<br>
> the high-level concerns followed by my
responses:<br>
><br>
</span>> * 8 minutes is arbitrary<br>
> o Yes, but it's no more arbitrary than 1440
minutes. I picked it as an<br>
<span>> interval because it is still 4
times larger than the last short value,<br>
> but it still allows VMs to regain
connectivity in <5 minutes in the<br>
> event their IP is changed. If someone
has a good suggestion for another<br>
> interval based on known dnsmasq QPS
limits or some other quantitative<br>
> reason, please chime in here.<br>
<br>
</span>We run 48 hours as the default in our public
cloud, and I did some digging to<br>
remind myself of the multiple reasons:<br>
<br>
1. Too much DHCP traffic. Sure, only that initial
request is broadcast, but<br>
dnsmasq is very verbose and loves writing to syslog
for everything it does -<br>
less is more. Do a scale test with 10K VMs and you'll
quickly find out a large<br>
portion of traffic is DHCP RENEWs, and syslog is huge.<br>
</blockquote>
<div><br>
</div>
<div>This is correct, and something I overlooked in my
previous post. Nevertheless I still think that it is
really impossible to find an optimal default which is
regarded as such by every user. The current default
has been chosen mostly for the reason you explain
below, and I don't see a strong reason for changing
it.</div>
<div> <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>
2. During a control-plane upgrade or outage, having a
short DHCP lease time will<br>
take all your VMs offline. The old value of 2 minutes
is not a realistic value<br>
for an upgrade, and I don't think 8 minutes is much
better. Yes, when DHCP is<br>
down you can't boot a new VM, but as long as customers
can get to their existing<br>
VMs they're pretty happy and won't scream bloody
murder.<br>
</blockquote>
<div><br>
</div>
<div>In our cloud we were continuously hit bit this. We
could not take our dhcp agents out, otherwise all VMs
would lose their leases, unless the downtime of the
agent was very brief. </div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>
There's probably more, but those were the top two,
with #2 being most important.<br>
</blockquote>
<div><br>
</div>
<div>Summarizing, I think that Kevin is exposing a real,
albeit well-know problem (sorry about my dhcp release
faux pas - I can use jet lag as a justification!), and
he's proposing a mitigation to it. On the other hand,
this mitigation, as Brian explains, is going to cause
real operational issues. Still, we're arguing on the a
default value for a configuration parameter. I
therefore think the best thing that we can do is
explicitly stating what happens when setting long or
short lease times.</div>
<div>I expected this to be documented in [1], but it's
not. I think that place and neutron.conf might contain
this kind of documentation, such as:</div>
<div><br>
</div>
<div>
<div><font face="monospace, monospace"># DHCP Lease
duration (in seconds). </font></div>
<div><font face="monospace, monospace"># Use<span style="white-space:pre-wrap"> -1 to</span> tell
dnsmasq to use infinite lease times. <span style="white-space:pre-wrap"> </span></font></div>
<div><font face="monospace, monospace">#
dhcp_lease_duration = 86400</font></div>
</div>
<div><font face="monospace, monospace"># Note that long
DHCP leases will result in delays</font></div>
<div><font face="monospace, monospace"># in instances
acquiring updated IP addresses. This</font></div>
<div><font face="monospace, monospace"># may result in
downtime for those instance as anti</font></div>
<div><font face="monospace, monospace"># spoof policy
will then block all traffic in and out of</font></div>
<div><font face="monospace, monospace"># them. In order
to minimise this downtime window</font></div>
<div><font face="monospace, monospace"># the lease time
should be shorter, for example</font></div>
<div><font face="monospace, monospace">#
dhcp_lease_duration = 480</font><br>
</div>
<div><br>
</div>
<div>However, I would not change the current system
default, as this might affect operational systems.</div>
<div><br>
</div>
<div>Apologies again for my stupid dhcp-release note,</div>
<div>Salvatore</div>
<div><br>
</div>
<div>[1] <a href="http://developer.openstack.org/api-ref-networking-v2.html" target="_blank">http://developer.openstack.org/api-ref-networking-v2.html</a></div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>
> * other datacenters use long lease times<br>
> o This is true, but it's not really a valid
comparison. In most regular<br>
<span>> datacenters, updating a static DHCP
lease has no effect on the data<br>
> plane so it doesn't matter that the
client doesn't react for hours/days<br>
> (even with DHCP snooping enabled).
However, in Neutron's case, the<br>
> security groups are immediately updated
so all traffic using the old<br>
> address is blocked.<br>
<br>
</span>Yes, and choosing the lease time is a
deployment decision that needs to take a<br>
lot of things into account. Like I said, we don't
even use the default. The<br>
default should just be a good guess for a standard
deployment, not a value that<br>
caters towards the edge cases, especially when the
value is tunable in neutron.conf.<br>
<br>
> * dhcp traffic is scary because it's broadcast<br>
> o ARP traffic is also broadcast and many
clients will expire entries every<br>
<span>> 5-10 minutes and re-ARP.
L2population may be used to prevent ARP<br>
> propagation, so the comparison between
DHCP and ARP isn't always<br>
> relevant here.<br>
<br>
</span>I don't recall anyone being scared of
broadcast, and can't find any comments<br>
regarding it in <a href="https://review.openstack.org/#/c/150595/" target="_blank">https://review.openstack.org/#/c/150595/</a><br>
<span><br>
> Please reply back with your
opinions/anecdotes/data related to short DHCP lease<br>
> times.<br>
<br>
</span>I can only speculate on why 24 hours was chosen
as the default back in 2013,<br>
possibly because a lot of wireless router firmware
defaults are set as such?<br>
<span><br>
> 1. <a href="https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe" target="_blank">https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe</a><br>
> 2. Manual intervention could be an instance
reboot, a dhcp client invocation via<br>
> the console, or a delayed invocation right
before the update. (all significantly<br>
> more difficult to script than a simple update
of a port's IP via the API).<br>
> 3. <a href="https://review.openstack.org/#/c/150595/" target="_blank">https://review.openstack.org/#/c/150595/</a><br>
> 4. <a href="http://i.imgur.com/xtvatkP.jpg" target="_blank">http://i.imgur.com/xtvatkP.jpg</a><br>
<br>
</span>I was a much bigger baby than that :)<br>
<span><font color="#888888"><br>
-Brian<br>
</font></span>
<div>
<div><br>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage
questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
<br>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<br>
<pre>__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: <a href="mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a>
</pre>
</blockquote>
<br>
</div></div></div>
<br>__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div>Kevin Benton</div></div>
</div>