Re: [stein][neutron] gratuitous arp

29 Apr 2020

      Hello Sean,
the following is the configuration on my compute nodes:
[root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt
libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64
libvirt-daemon-kvm-4.5.0-33.el7.x86_64
libvirt-libs-4.5.0-33.el7.x86_64
libvirt-daemon-driver-network-4.5.0-33.el7.x86_64
libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64
libvirt-client-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64
libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64
libvirt-daemon-4.5.0-33.el7.x86_64
libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64
libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64
libvirt-bash-completion-4.5.0-33.el7.x86_64
libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64
libvirt-python-4.5.0-1.el7.x86_64
libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64
libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64
[root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu
qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64
qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64
libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64
centos-release-qemu-ev-1.0-4.el7.centos.noarch
ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch
qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver /etc/neutron/plugins/ml2/openvswitch_agent.ini:

firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein testing
environment and the
same firewall driver.
Live migration on provider network on queens works fine.
It does not work fine on rocky and stein (vm lost connection after it is
migrated and start to respond only when the vm send a network packet , for
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <smooney@redhat.com>
ha scritto:
...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:
...
Hello, some updated about this issue.
I read someone has got same issue as reported here:
https://bugs.launchpad.net/neutron/+bug/1866139
If you read the discussion, someone tells that the garp must be sent by
qemu during live miration.
If this is true, this means on rocky/stein the qemu/libvirt are bugged.
it is not correct.
qemu/libvir thas alsway used RARP which predates GARP to serve as its mac
learning frames
instead https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html
however it looks like this was broken in 2016 in qemu 2.6.0
https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html
but was fixed by
https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b
can you confirm you are not using the broken 2.6.0 release and are using
2.7 or newer or 2.4 and older.
...
So I tried to use stein and rocky with the same version of libvirt/qemu
packages I installed on queens (I updated compute and controllers node on
queens for obtaining same libvirt/qemu version deployed on rocky and
stein).
On queens live migration on provider network continues to work fine.
On rocky and stein not, so I think the issue is related to openstack
components .
on queens we have only a singel prot binding and nova blindly assumes that
the port binding details wont
change when it does a live migration and does not update the xml for the
netwrok interfaces.
the port binding is updated after the migration is complete in
post_livemigration
in rocky+ neutron optionally uses the multiple port bindings flow to
prebind the port to the destiatnion
so it can update the xml if needed and if post copy live migration is
enable it will asyconsly activate teh dest port
binding before post_livemigration shortenting the downtime.
if you are using the iptables firewall os-vif will have precreated the ovs
port and intermediate linux bridge before the
migration started which will allow neutron to wire it up (put it on the
correct vlan and install security groups) before
the vm completes the migraton.
if you are using the ovs firewall os-vif still precreates teh ovs port but
libvirt deletes it and recreats it too.
as a result there is a race when using openvswitch firewall that can
result in the RARP packets being lost.
...
Best Regards
Ignazio Cassano
Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <smooney@redhat.com
ha scritto:
...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:
...
Hello, I have this problem with rocky or newer with iptables_hybrid
firewall.
So, can I solve using post copy live migration ???
so this behavior has always been how nova worked but rocky the
...
...
spec intoduced teh ablity to shorten the outage by pre biding the port
and
activating it when
the vm is resumed on the destiation host before we get to pos live
migrate.
this reduces the outage time although i cant be fully elimiated as some
level of packet loss is
always expected when you live migrate.
so yes enabliy post copy live migration should help but be aware that
if a
network partion happens
during a post copy live migration the vm will crash and need to be
restarted.
it is generally safe to use and will imporve the migration performace
but
unlike pre copy migration if
the guess resumes on the dest and the mempry page has not been copied
yet
then it must wait for it to be copied
and retrive it form the souce host. if the connection too the souce
host
is intrupted then the vm cant
do that and the migration will fail and the instance will crash. if you
are using precopy migration
if there is a network partaion during the migration the migration will
fail but the instance will continue
to run on the source host.
so while i would still recommend using it, i it just good to be aware
of
that behavior change.
...
Thanks
Ignazio
Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha
scritto:
...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:
...
Hello, I have a problem on stein neutron. When a vm migrate from
one
node
...
...
...
to another I cannot ping it for several minutes. If in the vm I
...
...
...
...
...
script that ping the gateway continously, the live migration
works
fine
...
...
and
...
I can ping it. Why this happens ? I read something about
gratuitous
arp.
...
...
qemu does not use gratuitous arp but instead uses an older protocal
called
...
...
RARP
to do mac address learning.
what release of openstack are you using. and are you using iptables
firewall of openvswitch firewall.
if you are using openvswtich there is is nothing we can do until we
finally delegate vif pluging to os-vif.
currently libvirt handels interface plugging for kernel ovs when
using
the
...
...
openvswitch firewall driver
https://review.opendev.org/#/c/602432/ would adress that but it
and
the
...
...
neutron patch are
https://review.opendev.org/#/c/640258 rather out dated. while
...
...
is
...
...
pluging the vif there will always be
a race condition where the RARP packets sent by qemu and then mac
learning
...
...
packets will be lost.
if you are using the iptables firewall and you have opnestack rock
or
...
...
...
later then if you enable post copy live migration
it should reduce the downtime. in this conficution we do not have
https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...
put a
libvirt
the
...
...
race
...
...
betwen neutron and libvirt so the rarp
packets should not be lost.
...
Please, help me ?
Any workaround , please ?
Best Regards
Ignazio