[openstack-dev] [neutron] VXLAN with single-NIC compute nodes: Avoiding the MTU pitfalls

Fredy Neeser Fredy.Neeser at solnet.ch
Thu Mar 12 12:33:18 UTC 2015


On 11.03.2015 19:31, Ian Wells wrote:
> On 11 March 2015 at 04:27, Fredy Neeser <Fredy.Neeser at solnet.ch 
> <mailto:Fredy.Neeser at solnet.ch>> wrote:
>
>     7: br-ex.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>     noqueue state UNKNOWN group default
>         link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff
>         inet 192.168.1.14/24 <http://192.168.1.14/24> brd
>     192.168.1.255 scope global br-ex.1
>            valid_lft forever preferred_lft forever
>
>     8: br-ex.12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1554 qdisc
>     noqueue state UNKNOWN group default
>         link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff
>         inet 192.168.1.14/24 <http://192.168.1.14/24> brd
>     192.168.1.255 scope global br-ex.12
>            valid_lft forever preferred_lft forever
>
>
> I find it hard to believe that you want the same address configured on 
> *both* of these interfaces - which one do you think will be sending 
> packets?

Ian, thanks for your feedback!

I did choose the same address for the two interfaces, for three reasons:

1.  Within my home single-LAN (underlay) environment, traffic is 
switched, and VXLAN traffic is confined to VLAN 12, so there is never a 
conflict between IP 192.168.1.14 on VLAN 1 and the same IP on VLAN 12.
OTOH, for a more scalable VXLAN setup (with multiple underlays and L3 
routing in between), I would like to use different IPs for br-ex.1 and 
br-ex.12 -- for example by using separate subnets
   192.168.1.0/26  for VLAN 1
   192.168.12.0/26  for VLAN 12
However, I'm not quite there yet (see 3.).

2.  I'm using policy routing on my hosts to steer VXLAN traffic (UDP 
dest. port 4789) to interface br-ex.12 --  all other traffic from 
192.168.1.14 is source routed from br-ex.1, presumably because br-ex.1 
is a lower-numbered interface than br-ex.12  (?) -- interesting question 
whether I'm relying here on the order in which I created these two 
interfaces.

   [root at langrain ~]# ip a
   ...
   7: br-ex.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
state UNKNOWN group default
       link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff
       inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.1
          valid_lft forever preferred_lft forever
   8: br-ex.12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1554 qdisc noqueue 
state UNKNOWN group default
       link/ether e0:3f:49:b4:7c:a7 brd ff:ff:ff:ff:ff:ff
       inet 192.168.1.14/24 brd 192.168.1.255 scope global br-ex.12
          valid_lft forever preferred_lft forever

3.  It's not clear to me how to setup multiple nodes with packstack if a 
node's tunnel IP does not equal its admin IP (or the OpenStack API IP in 
case of a controller node).  With packstack, I can only specify the 
compute node IPs through CONFIG_COMPUTE_HOSTS. Presumably, these IPs are 
used for both packstack deployment (admin IP) and for configuring the 
VXLAN tunnel IPs (local_ip and remote_ip parameters).  How would I 
specify different IPs for these purposes? (Recall that my hosts have a 
single NIC).


In any case, native traffic on bridge br-ex is sent via br-ex.1 (VLAN 
1), which is also the reason the Neutron gateway port qg-XXX needs to be 
an access port for VLAN 1 (tag: 1).   VXLAN traffic is sent from 
br-ex.12 on all compute nodes.  See the 2 cases below:


Case 1. Max-size ping from compute node 'langrain' (192.168.1.14) to 
another host on same LAN
              => Native traffic sent from br-ex.1; no traffic sent from 
br-ex.12

[fn at langrain ~]$ ping -M do -s 1472 -c 1 192.168.1.54
PING 192.168.1.54 (192.168.1.54) 1472(1500) bytes of data.
1480 bytes from 192.168.1.54: icmp_seq=1 ttl=64 time=0.766 ms

[root at langrain ~]# tcpdump -n -i br-ex.1 dst 192.168.1.54
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-ex.1, link-type EN10MB (Ethernet), capture size 65535 bytes
10:32:37.666572 IP 192.168.1.14 > 192.168.1.54: ICMP echo request, id 
10432, seq 1, length 1480
10:32:42.673665 ARP, Request who-has 192.168.1.54 tell 192.168.1.14, 
length 28


Case 2: Max-size ping from a guest1 (10.0.0.1) on compute node 
'langrain' (192.168.1.14)
              to a guest2 (10.0.0.3) on another compute node 
(192.168.1.21) via VXLAN tunnel.
              Guests are on the same virtual network 10.0.0.0/24
              => Encapsulated traffic sent from br-ex.12; no traffic 
sent from br-ex.1

$ ping -M do -s 1472 -c 1 10.0.0.3
PING 10.0.0.3 (10.0.0.3) 1472(1500) bytes of data.
1480 bytes from 10.0.0.3: icmp_seq=1 ttl=64 time=2.22 ms

[root at langrain ~]# tcpdump -n -i br-ex.12
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-ex.12, link-type EN10MB (Ethernet), capture size 65535 bytes

11:02:56.916265 IP 192.168.1.14.47872 > 192.168.1.21.4789: VXLAN, flags 
[I] (0x08), vni 10
ARP, Request who-has 10.0.0.3 tell 10.0.0.1, length 28
11:02:56.916991 IP 192.168.1.21.51408 > 192.168.1.14.4789: VXLAN, flags 
[I] (0x08), vni 10
ARP, Reply 10.0.0.3 is-at fa:16:3e:e6:e1:c8, length 28
11:02:56.917282 IP 192.168.1.14.57836 > 192.168.1.21.4789: VXLAN, flags 
[I] (0x08), vni 10
IP 10.0.0.1 > 10.0.0.3: ICMP echo request, id 25474, seq 1, length 1480
11:02:56.918110 IP 192.168.1.21.44153 > 192.168.1.14.4789: VXLAN, flags 
[I] (0x08), vni 10
IP 10.0.0.3 > 10.0.0.1: ICMP echo reply, id 25474, seq 1, length 1480
11:03:01.918885 IP 192.168.1.21.51408 > 192.168.1.14.4789: VXLAN, flags 
[I] (0x08), vni 10
ARP, Request who-has 10.0.0.1 tell 10.0.0.3, length 28
11:03:01.919207 IP 192.168.1.14.57760 > 192.168.1.21.4789: VXLAN, flags 
[I] (0x08), vni 10
ARP, Reply 10.0.0.1 is-at fa:16:3e:f4:1d:89, length 28
11:03:01.920502 ARP, Request who-has 192.168.1.14 tell 192.168.1.21, 
length 46
11:03:01.920519 ARP, Reply 192.168.1.14 is-at e0:3f:49:b4:7c:a7, length 28


> You may find that configuring a VLAN interface for eth1.12 (not in a 
> bridge, with a local address suitable for communication with compute 
> nodes, for VXLAN traffic) and eth1.1 (in br-ex, for external traffic 
> to use) does better for you.
Hmm, I only have one NIC (eth0).  In order to attach eth0 to br-ex, I 
had to configure it as an OVSPort.
Maybe I misunderstand your alternative, but are you suggesting  to 
configure  eth0.1 as an OVSPort (connected to br-ex), and  eth0.12 as a 
standalone interface?  (Not sure a physical interface can be "brain 
split" in such a way.)

> I'm also not clear what your Openstack API endpoint address or MTU is 
> - maybe that's why the eth1.1 interface is addressed?
It's 192.168.1.14, and br-ex.1 is always used for native traffic, so the 
MTU is 1500.

Note that my physical switch uses a native VLAN of 1  and is configured 
with "Untag all ports" for VLAN 1. Moreover, OVSPort eth0 (attached to 
br-ex) is configured for VLAN trunking with a native VLAN of 1 
(vlan_mode: native-untagged, trunks: [1,12], tag: 1), so within bridge 
br-ex, native packets are tagged 1.

>   I can tell you that if you want your API to be on the same address 
> 192.168.1.14 as the VXLAN tunnel endpoints then it has to be one 
> address on one interface and the two functions will share the same MTU 
> - almost certainly not what you're looking for.
With my current setup (thanks to policy routing), I have the same IP on 
two interfaces br-ex.1 and br-ex.12, with MTUs 1500 and 1554, respectively.

>   If you source VXLAN packets from a different IP address then you can 
> put it on a different interface and give it a different MTU - which 
> appears to fit what you want much better.
Selecting different compute host IPs for admin (CONFIG_COMPUTE_HOSTS) 
and tunnel IPs would eliminate the need for policy routing and is also 
more suitable for scaling a VXLAN deployment across multiple independent 
L2 BC domains, but for that I'll need to resolve point 3. above  --  
pointers in that direction are much appreciated.

Thanks,
- Fredy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150312/12f49a51/attachment.html>


More information about the OpenStack-dev mailing list