[openstack-dev] [TripleO][Neutron] PMTUd broken in gre networks

Robert Collins robertc at robertcollins.net
Wed Jan 22 11:01:10 UTC 2014


On 22 January 2014 21:28, Ian Wells <ijw.ubuntu at cack.org.uk> wrote:
> On 22 January 2014 00:00, Robert Collins <robertc at robertcollins.net> wrote:
>>
>> I think dropping frames that can't be forwarded is entirely sane - at
>>
>> a guess it's what a physical ethernet switch would do if you try to
>> send a 1600 byte frame (on a non-jumbo-frame switched network) - but
>> perhaps there is an actual standard for this we could follow?
>
>
> Speaking from bitter experience, if you've misconfigured your switch so that
> it's dropping packets for this reason, you will have a period of hair
> tearing out to solve the problem before you work it out.  Believe me, been
> there, rabbit messages that don't turn up because they're the first ones
> that were too big are not a helpful diagnostic indicator.

PMTU blackhole problems show the same symptoms :) - been there, done tat.

> Getting the MTU *right* on all hosts seems to be key to keeping your hair
> attached to your head for a little longer.  Hence the DHCP suggestion to set
> it to the right value.

I certainly think having the MTU set to the right value is important.
I wonder if there's a standard way we can signal the MTU (e.g. in the
virtio interface) other than DHCP. Not because DHCP is bad, but
because that would work with statically injected network configs as
well.

>> > (c) we require Neutron plugins to work out the MTU, which for
>> > any encap except VLAN is (host interface MTU - header size).
>>
>> do you mean tunnel wrap overheads? (What if a particular tunnel has a
>> trailer.. crazy talk I know).
>
>
> Yup, basically.  Unfortunately, thinking about this a bit more, you can't
> easily be certain what the max packet size allowed in a GRE tunnel is going
> to be, because you don't know which interface it's going over (or what's
> between), but to a certain extent we can use config items to fix what we
> can't discover.

One thing we could do is encourage OS vendors to turn
/proc/sys/net/ipv4/tcp_mtu_probing
(http://www.ietf.org/rfc/rfc4821.txt) on in combination with dropping
over-size frames. That should detect the actual MTU.

Another thing would be for encapsulation failures in the switch to be
reflected in the vNIC in the instance - export back media errors (e.g.
babbles) so that users can diagnose problems.

Note that IPv6 doesn't *have* a DF bit, because routers are not
permitted to fragment - arguably encapsulating an ipv6 frame in GRE
and then fragmenting the outer layer is a violation of that.

As for automatically determining the size - we can determine the PMTU
between all hosts in the mesh, report those back centrally and take
the lowest then subtract the GRE overhead.

-Rob


-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list