[openstack-dev] [TripleO][Neutron] PMTUd broken in gre networks

Ian Wells ijw.ubuntu at cack.org.uk
Wed Jan 22 11:58:58 UTC 2014


On 22 January 2014 12:01, Robert Collins <robertc at robertcollins.net> wrote:

> > Getting the MTU *right* on all hosts seems to be key to keeping your hair
> > attached to your head for a little longer.  Hence the DHCP suggestion to
> set
> > it to the right value.
>
> I certainly think having the MTU set to the right value is important.
> I wonder if there's a standard way we can signal the MTU (e.g. in the
> virtio interface) other than DHCP. Not because DHCP is bad, but
> because that would work with statically injected network configs as
> well.
>

To the best of my knowledge, no.  And it wants to be a part of the static
config too.

<derail>
And the static config, the last I checked, also sucks - we really want the
data to be in a metadata format that cloud-init absorbs, but the last I
checked there's a feature in config-drive et al that writes
/etc/network/interfaces.  Which is no use to anyone on Windows, or Redhat,
or...
</derail>


> One thing we could do is encourage OS vendors to turn
> /proc/sys/net/ipv4/tcp_mtu_probing
> (http://www.ietf.org/rfc/rfc4821.txt) on in combination with dropping
> over-size frames. That should detect the actual MTU.
>

Though it's really a bit of a workaround.

Another thing would be for encapsulation failures in the switch to be
> reflected in the vNIC in the instance - export back media errors (e.g.
> babbles) so that users can diagnose problems.
>

Ditto.


> Note that IPv6 doesn't *have* a DF bit, because routers are not
> permitted to fragment - arguably encapsulating an ipv6 frame in GRE
> and then fragmenting the outer layer is a violation of that.
>

Fragmentation is fine for the tunnel, *if* the tunnel also reassembles. The
issue of fragmentation is it's horrible to implement on all your endpoints,
aiui, and used to lead to innumerable fragmentation attacks.

As for automatically determining the size - we can determine the PMTU
> between all hosts in the mesh, report those back centrally and take
> the lowest then subtract the GRE overhead.
>

If there's one path, and if there's no lower MTU on the GRE path (which can
go via routers)...  We can make an educated guess at the MTU but we can't
know it without testing each GRE tunnel as we set it up (and multiple
routes defeats even that) so I would recommend a config option as the best
of a nasty set of choices.  It can still go wrong but it's then blatantly
and obviously a config fault rather than some code guessing wrong, which
would be harder for an end user to work around.
-- 
Ian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140122/08171aa0/attachment.html>


More information about the OpenStack-dev mailing list