[stein][neutron] gratuitous arp

newer
Re: [release-announce] oslo.log...

Ignazio Cassano

27 Apr 2020 27 Apr '20

8:06 a.m.

Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine and I can ping it. Why this happens ? I read something about gratuitous arp. Please, help me ? Any workaround , please ? Best Regards Ignazio

Attachments:

attachment.html (text/html — 439 bytes)

Show replies by thread

Sean Mooney

27 Apr 27 Apr

8:57 a.m.

On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...

Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine and I can ping it. Why this happens ? I read something about gratuitous arp. qemu does not use gratuitous arp but instead uses an older protocal called RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall. if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using the openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and the neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt is pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac learning packets will be lost. if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the race betwen neutron and libvirt so the rarp packets should not be lost.

...

Please, help me ? Any workaround , please ?

Best Regards Ignazio

Ignazio Cassano

9:19 a.m.

Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ??? Thanks Ignazio Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...

On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine and I can ping it. Why this happens ? I read something about gratuitous arp. qemu does not use gratuitous arp but instead uses an older protocal called RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using the openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and the neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt is pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac learning packets will be lost.

if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the race betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Sean Mooney

10:50 a.m.

...

Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ??? so this behavior has always been how nova worked but rocky the https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... spec intoduced teh ablity to shorten the outage by pre biding the port and activating it when

On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: the vm is resumed on the destiation host before we get to pos live migrate. this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate. so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host. so while i would still recommend using it, i it just good to be aware of that behavior change.

...

Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine

and

...
I can ping it. Why this happens ? I read something about gratuitous arp.

qemu does not use gratuitous arp but instead uses an older protocal called RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using the openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and the neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt is pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac learning packets will be lost.

if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the race betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Ignazio Cassano

11:02 a.m.

Hello and thanks for you answer , I tried to enable post copy live migration on compute node, restarted the nova compute service on compute nodes but when I try to live migrate a vm I have the same issue: I loose a lot of packets. If I try with an instance which have a daemon which ping the default gateway, the live migration works fine and I do not loose packets. I think this is a big issue. I tried with rocky and queens with iptables_hybrid and both have the issue. :-( Ignazio

Ignazio Cassano

12:13 p.m.

Sorry for my mistake. I tried with rocky and stein and bot have same issue. After a live migration ping starts to work after some minutes (3 or 4 minutes). I never seen this issue before. Ignazio Il Lun 27 Apr 2020, 20:02 Ignazio Cassano <ignaziocassano@gmail.com> ha scritto:

...

Hello and thanks for you answer , I tried to enable post copy live migration on compute node, restarted the nova compute service on compute nodes but when I try to live migrate a vm I have the same issue: I loose a lot of packets. If I try with an instance which have a daemon which ping the default gateway, the live migration works fine and I do not loose packets. I think this is a big issue. I tried with rocky and queens with iptables_hybrid and both have the issue. :-( Ignazio

Ignazio Cassano

10:52 p.m.

Hello, I googled for my issue and I found the following: https://bugs.launchpad.net/neutron/+bug/1866139 Regards Ignazio Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <smooney@redhat.com> ha scritto:

...

On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ??? so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... spec intoduced teh ablity to shorten the outage by pre biding the port and activating it when the vm is resumed on the destiation host before we get to pos live migrate.

this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware of that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine

and

...
I can ping it. Why this happens ? I read something about gratuitous arp.

qemu does not use gratuitous arp but instead uses an older protocal called RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using the openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and the neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt is pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac learning packets will be lost.

if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the race betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Ignazio Cassano

28 Apr 28 Apr

4:44 a.m.

I made some test with queens ans rocky: on queens the vm migrated makes an arp request and when you ping it , receives arp replay from the physical router. Ol rocky it makes an arp request but when you ping it, it does not receive any arp replay. It starts to responds only when it send traffic for example polling ntp server Ignazio Il giorno mar 28 apr 2020 alle ore 07:52 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...

Hello, I googled for my issue and I found the following:

https://bugs.launchpad.net/neutron/+bug/1866139

Regards Ignazio

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ??? so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... spec intoduced teh ablity to shorten the outage by pre biding the port and activating it when the vm is resumed on the destiation host before we get to pos live migrate.

this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware of that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine

and

...
I can ping it. Why this happens ? I read something about gratuitous arp.

qemu does not use gratuitous arp but instead uses an older protocal called RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using the openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and the neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt is pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac learning packets will be lost.

if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the race betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Ignazio Cassano

29 Apr 29 Apr

1:39 a.m.

Hello, some updated about this issue. I read someone has got same issue as reported here: https://bugs.launchpad.net/neutron/+bug/1866139 If you read the discussion, someone tells that the garp must be sent by qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are bugged. So I tried to use stein and rocky with the same version of libvirt/qemu packages I installed on queens (I updated compute and controllers node on queens for obtaining same libvirt/qemu version deployed on rocky and stein). On queens live migration on provider network continues to work fine. On rocky and stein not, so I think the issue is related to openstack components . Best Regards Ignazio Cassano Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <smooney@redhat.com> ha scritto:

...

On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ??? so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... spec intoduced teh ablity to shorten the outage by pre biding the port and activating it when the vm is resumed on the destiation host before we get to pos live migrate.

this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware of that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one node to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works fine

and

...
I can ping it. Why this happens ? I read something about gratuitous arp.

qemu does not use gratuitous arp but instead uses an older protocal called RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using the openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and the neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt is pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac learning packets will be lost.

if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the race betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Sean Mooney

5:36 a.m.

On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...

Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be sent by qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are bugged. it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as its mac learning frames instead https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b can you confirm you are not using the broken 2.6.0 release and are using 2.7 or newer or 2.4 and older.

...

So I tried to use stein and rocky with the same version of libvirt/qemu packages I installed on queens (I updated compute and controllers node on queens for obtaining same libvirt/qemu version deployed on rocky and stein).

On queens live migration on provider network continues to work fine. On rocky and stein not, so I think the issue is related to openstack components . on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for the netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime. if you are using the iptables firewall os-vif will have precreated the ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on the correct vlan and install security groups) before the vm completes the migraton. if you are using the ovs firewall os-vif still precreates teh ovs port but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...

Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ???

so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... spec intoduced teh ablity to shorten the outage by pre biding the port and activating it when the vm is resumed on the destiation host before we get to pos live migrate.

this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware of that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one

node

...
...
...
to another I cannot ping it for several minutes. If in the vm I put a script that ping the gateway continously, the live migration works

fine

...
...
and

...
I can ping it. Why this happens ? I read something about gratuitous

arp.

...
...
qemu does not use gratuitous arp but instead uses an older protocal

called

...
...
RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using

the

...
...
openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and

the

...
...
neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while libvirt

is

...
...
pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac

learning

...
...
packets will be lost.

if you are using the iptables firewall and you have opnestack rock or later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have the

race

...
...
betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Ignazio Cassano

7:19 a.m.

Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64 As far as firewall driver /etc/neutron/plugins/ml2/openvswitch_agent.ini: firewall_driver = iptables_hybrid I have same libvirt/qemu version on queens, on rocky and on stein testing environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it is migrated and start to respond only when the vm send a network packet , for example when chrony pools the time server). Ignazio Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <smooney@redhat.com> ha scritto:

...

On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be sent by qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are bugged. it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as its mac learning frames instead https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b can you confirm you are not using the broken 2.6.0 release and are using 2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of libvirt/qemu packages I installed on queens (I updated compute and controllers node on queens for obtaining same libvirt/qemu version deployed on rocky and stein).

On queens live migration on provider network continues to work fine. On rocky and stein not, so I think the issue is related to openstack components . on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for the netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated the ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on the correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs port but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <smooney@redhat.com

ha scritto:

...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ???

so this behavior has always been how nova worked but rocky the

...
...
spec intoduced teh ablity to shorten the outage by pre biding the port and activating it when the vm is resumed on the destiation host before we get to pos live migrate.

this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware of that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:

...
Hello, I have a problem on stein neutron. When a vm migrate from one

node

...
...
...
to another I cannot ping it for several minutes. If in the vm I

...
...
...
...
...
script that ping the gateway continously, the live migration works

fine

...
...
and

...
I can ping it. Why this happens ? I read something about

gratuitous

arp.

...
...
qemu does not use gratuitous arp but instead uses an older protocal

called

...
...
RARP to do mac address learning.

what release of openstack are you using. and are you using iptables firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using

the

...
...
openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and

the

...
...
neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while

...
...
is

...
...
pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac

learning

...
...
packets will be lost.

if you are using the iptables firewall and you have opnestack rock

or

...
...
...
later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... put a libvirt the

...
...
race

...
...
betwen neutron and libvirt so the rarp packets should not be lost.

...
Please, help me ? Any workaround , please ?

Best Regards Ignazio

Ignazio Cassano

7:37 a.m.

PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...

Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver /etc/neutron/plugins/ml2/openvswitch_agent.ini:

firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein testing environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it is migrated and start to respond only when the vm send a network packet , for example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be sent by qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are bugged. it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as its mac learning frames instead https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b can you confirm you are not using the broken 2.6.0 release and are using 2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of libvirt/qemu packages I installed on queens (I updated compute and controllers node on queens for obtaining same libvirt/qemu version deployed on rocky and stein).

On queens live migration on provider network continues to work fine. On rocky and stein not, so I think the issue is related to openstack components . on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for the netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated the ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on the correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs port but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

...
ha scritto:

...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ???

so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
spec intoduced teh ablity to shorten the outage by pre biding the

...
...
activating it when the vm is resumed on the destiation host before we get to pos live migrate.

this reduces the outage time although i cant be fully elimiated as some level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that if a network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace but unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied yet then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce host is intrupted then the vm cant do that and the migration will fail and the instance will crash. if you are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware of that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > Hello, I have a problem on stein neutron. When a vm migrate from one

node

...
...
> to another I cannot ping it for several minutes. If in the vm I

...
...
...
...
> script that ping the gateway continously, the live migration works

fine

...
...
and > I can ping it. Why this happens ? I read something about

gratuitous

arp.

...
...
qemu does not use gratuitous arp but instead uses an older

...
...
called

...
...
RARP to do mac address learning.

what release of openstack are you using. and are you using

iptables

...
...
...
firewall of openvswitch firewall.

if you are using openvswtich there is is nothing we can do until we finally delegate vif pluging to os-vif. currently libvirt handels interface plugging for kernel ovs when using

the

...
...
openvswitch firewall driver https://review.opendev.org/#/c/602432/ would adress that but it and

the

...
...
neutron patch are https://review.opendev.org/#/c/640258 rather out dated. while

...
...
is

...
...
pluging the vif there will always be a race condition where the RARP packets sent by qemu and then mac

learning

...
...
packets will be lost.

if you are using the iptables firewall and you have opnestack

rock or

...
...
...
later then if you enable post copy live migration it should reduce the downtime. in this conficution we do not have

smooney@redhat.com> port and put a protocal libvirt the

...
...
race

...
...
betwen neutron and libvirt so the rarp packets should not be lost.

> Please, help me ? > Any workaround , please ? > > Best Regards > Ignazio

Sean Mooney

7:55 a.m.

so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality. i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done. im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do. On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...

PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver /etc/neutron/plugins/ml2/openvswitch_agent.ini:

firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein testing environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it is migrated and start to respond only when the vm send a network packet , for example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be sent by qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are bugged.

it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as its mac learning frames instead https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b can you confirm you are not using the broken 2.6.0 release and are using 2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of libvirt/qemu packages I installed on queens (I updated compute and controllers node

on

...
queens for obtaining same libvirt/qemu version deployed on rocky and

stein).

...
On queens live migration on provider network continues to work fine. On rocky and stein not, so I think the issue is related to openstack components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for the netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated the ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on the correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs port but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com>

...
ha scritto:

...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:

...
Hello, I have this problem with rocky or newer with iptables_hybrid firewall. So, can I solve using post copy live migration ???

so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
spec intoduced teh ablity to shorten the outage by pre biding the

port and

...
...
activating it when the vm is resumed on the destiation host before we get to pos live

migrate.

...
...
this reduces the outage time although i cant be fully elimiated as

some

...
...
level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware that

if a

...
...
network partion happens during a post copy live migration the vm will crash and need to be restarted. it is generally safe to use and will imporve the migration performace

but

...
...
unlike pre copy migration if the guess resumes on the dest and the mempry page has not been copied

yet

...
...
then it must wait for it to be copied and retrive it form the souce host. if the connection too the souce

host

...
...
is intrupted then the vm cant do that and the migration will fail and the instance will crash. if

you

...
...
are using precopy migration if there is a network partaion during the migration the migration will fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware

of

...
...
that behavior change.

...
Thanks Ignazio

Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha

scritto:

...
...
...
> On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > Hello, I have a problem on stein neutron. When a vm migrate

from one

...
...
node

...
> > to another I cannot ping it for several minutes. If in the vm I

put a

...
...
...
> > script that ping the gateway continously, the live migration

works

...
...
fine

...
> > and > > I can ping it. Why this happens ? I read something about

gratuitous

...
...
arp.

...
> > qemu does not use gratuitous arp but instead uses an older

protocal

...
...
called

...
> RARP > to do mac address learning. > > what release of openstack are you using. and are you using

iptables

...
...
...
> firewall of openvswitch firewall. > > if you are using openvswtich there is is nothing we can do until

we

...
...
...
> finally delegate vif pluging to os-vif. > currently libvirt handels interface plugging for kernel ovs when

using

...
...
the

...
> openvswitch firewall driver > https://review.opendev.org/#/c/602432/ would adress that but it

and

...
...
the

...
> neutron patch are > https://review.opendev.org/#/c/640258 rather out dated. while

libvirt

...
...
is

...
> pluging the vif there will always be > a race condition where the RARP packets sent by qemu and then mac

learning

...
> packets will be lost. > > if you are using the iptables firewall and you have opnestack

rock or

...
...
...
> later then if you enable post copy live migration > it should reduce the downtime. in this conficution we do not have

the

...
...
race

...
> betwen neutron and libvirt so the rarp > packets should not be lost. > > > > Please, help me ? > > Any workaround , please ? > > > > Best Regards > > Ignazio > >

Ignazio Cassano

8:10 a.m.

Many thanks. Please keep in touch. Ignazio Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com> ha scritto:

...

so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver /etc/neutron/plugins/ml2/openvswitch_agent.ini:

firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein testing environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it is migrated and start to respond only when the vm send a network packet , for example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney < smooney@redhat.com> ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be sent by qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are bugged.

it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as its mac learning frames instead https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
can you confirm you are not using the broken 2.6.0 release and are using 2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of

...
...
...
...
packages I installed on queens (I updated compute and controllers node

on

...
queens for obtaining same libvirt/qemu version deployed on rocky and

stein).

...
On queens live migration on provider network continues to work

fine.

...
On rocky and stein not, so I think the issue is related to openstack components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com>

...
ha scritto:

...
On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > Hello, I have this problem with rocky or newer with iptables_hybrid > firewall. > So, can I solve using post copy live migration ???

so this behavior has always been how nova worked but rocky the

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
spec intoduced teh ablity to shorten the outage by pre biding the

port and

...
...
activating it when the vm is resumed on the destiation host before we get to pos

...
...
...
migrate.

...
...
this reduces the outage time although i cant be fully elimiated

as

...
some

...
...
level of packet loss is always expected when you live migrate.

so yes enabliy post copy live migration should help but be aware

...
...
...
if a

...
...
network partion happens during a post copy live migration the vm will crash and need to

be

...
...
...
restarted. it is generally safe to use and will imporve the migration

...
...
...
but

...
...
unlike pre copy migration if the guess resumes on the dest and the mempry page has not been

copied

...
yet

...
...
then it must wait for it to be copied and retrive it form the souce host. if the connection too the

souce

...
host

...
...
is intrupted then the vm cant do that and the migration will fail and the instance will crash.

if

...
you

...
...
are using precopy migration if there is a network partaion during the migration the

migration will

...
...
...
fail but the instance will continue to run on the source host.

so while i would still recommend using it, i it just good to be aware

of

...
...
that behavior change.

> Thanks > Ignazio > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha

scritto:

...
...
> > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > Hello, I have a problem on stein neutron. When a vm migrate

from one

...
...
node > > > to another I cannot ping it for several minutes. If in the

vm I

put a

...
...
> > > script that ping the gateway continously, the live migration

works

...
...
fine > > > > and > > > I can ping it. Why this happens ? I read something about

gratuitous

...
...
arp. > > > > qemu does not use gratuitous arp but instead uses an older

protocal

...
...
called > > RARP > > to do mac address learning. > > > > what release of openstack are you using. and are you using

iptables

...
...
> > firewall of openvswitch firewall. > > > > if you are using openvswtich there is is nothing we can do until

we

...
...
> > finally delegate vif pluging to os-vif. > > currently libvirt handels interface plugging for kernel ovs when

using

...
...
the > > openvswitch firewall driver > > https://review.opendev.org/#/c/602432/ would adress that

but it

and

...
...
the > > neutron patch are > > https://review.opendev.org/#/c/640258 rather out dated.

while

libvirt

...
...
is > > pluging the vif there will always be > > a race condition where the RARP packets sent by qemu and

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote: libvirt/qemu the the the port live that performace then mac

...
...
...
...
...
learning > > packets will be lost. > > > > if you are using the iptables firewall and you have opnestack

rock or

...
...
> > later then if you enable post copy live migration > > it should reduce the downtime. in this conficution we do not have

the

...
...
race > > betwen neutron and libvirt so the rarp > > packets should not be lost. > > > > > > > Please, help me ? > > > Any workaround , please ? > > > > > > Best Regards > > > Ignazio > > > >

Sean Mooney

10:48 a.m.

...

Many thanks. Please keep in touch. here are the two patches.

On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled. this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs. i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...

Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com> ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be

sent by

...
...
...
...
qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
...
packages I installed on queens (I updated compute and controllers

node

...
...
...
on

...
queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
...
stein).

...
On queens live migration on provider network continues to work

fine.

...
...
...
...
On rocky and stein not, so I think the issue is related to

openstack

...
...
...
...
components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated

the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on

the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com>

...
ha scritto:

> On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
...
> > firewall. > > So, can I solve using post copy live migration ??? > > so this behavior has always been how nova worked but rocky the > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
...
> spec intoduced teh ablity to shorten the outage by pre biding the

port and

...
> activating it when > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate.

...
> > this reduces the outage time although i cant be fully elimiated

as

...
...
...
some

...
> level of packet loss is > always expected when you live migrate. > > so yes enabliy post copy live migration should help but be aware

that

...
...
...
if a

...
> network partion happens > during a post copy live migration the vm will crash and need to

be

...
...
...
...
> restarted. > it is generally safe to use and will imporve the migration

performace

...
...
...
but

...
> unlike pre copy migration if > the guess resumes on the dest and the mempry page has not been

copied

...
...
...
yet

...
> then it must wait for it to be copied > and retrive it form the souce host. if the connection too the

souce

...
...
...
host

...
> is intrupted then the vm cant > do that and the migration will fail and the instance will crash.

if

...
...
...
you

...
> are using precopy migration > if there is a network partaion during the migration the

migration will

...
...
...
...
> fail but the instance will continue > to run on the source host. > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of

...
> that behavior change. > > > Thanks > > Ignazio > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com> ha

scritto:

...
> > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > Hello, I have a problem on stein neutron. When a vm migrate

from one

...
> > node > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
...
put a

...
> > > > script that ping the gateway continously, the live

migration

...
...
...
works

...
> > fine > > > > > > and > > > > I can ping it. Why this happens ? I read something about

gratuitous

...
> > arp. > > > > > > qemu does not use gratuitous arp but instead uses an older

protocal

...
> > called > > > RARP > > > to do mac address learning. > > > > > > what release of openstack are you using. and are you using

iptables

...
> > > firewall of openvswitch firewall. > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
...
we

...
> > > finally delegate vif pluging to os-vif. > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
...
using

...
> > the > > > openvswitch firewall driver > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and

...
> > the > > > neutron patch are > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt

...
> > is > > > pluging the vif there will always be > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
...
> > learning > > > packets will be lost. > > > > > > if you are using the iptables firewall and you have opnestack

rock or

...
> > > later then if you enable post copy live migration > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the

...
> > race > > > betwen neutron and libvirt so the rarp > > > packets should not be lost. > > > > > > > > > > Please, help me ? > > > > Any workaround , please ? > > > > > > > > Best Regards > > > > Ignazio > > > > > > > >

Ignazio Cassano

30 Apr 30 Apr

12:52 a.m.

Hello Sean, many thanks for your precious work. When you finished I can test following your instructions. Best Regards Ignazio Il giorno mer 29 apr 2020 alle ore 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Many thanks. Please keep in touch. here are the two patches.

On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com

ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

...
...
migrated and start to respond only when the vm send a network

is packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > Hello, some updated about this issue. > I read someone has got same issue as reported here: > > https://bugs.launchpad.net/neutron/+bug/1866139 > > If you read the discussion, someone tells that the garp must be

sent by

...
...
...
> qemu during live miration. > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
however it looks like this was broken in 2016 in qemu 2.6.0

https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html

...
but was fixed by

...
...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

> So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
> packages I installed on queens (I updated compute and controllers

node

...
...
...
on > queens for obtaining same libvirt/qemu version deployed on

rocky

and

...
...
...
stein). > > On queens live migration on provider network continues to work

fine.

...
...
...
> On rocky and stein not, so I think the issue is related to

openstack

...
...
...
> components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b precreated

...
the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it

on

...
the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that

...
...
...
...
result in the RARP packets being lost.

> > Best Regards > Ignazio Cassano > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com> > ha scritto: > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
> > > firewall. > > > So, can I solve using post copy live migration ??? > > > > so this behavior has always been how nova worked but rocky

can the

...
...
...
...
> > > >

...
...
...
...
> > spec intoduced teh ablity to shorten the outage by pre biding the

port and > > activating it when > > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate. > > > > this reduces the outage time although i cant be fully

elimiated

as

...
...
...
some > > level of packet loss is > > always expected when you live migrate. > > > > so yes enabliy post copy live migration should help but be

aware

that

...
...
...
if a > > network partion happens > > during a post copy live migration the vm will crash and need

to

be

...
...
...
> > restarted. > > it is generally safe to use and will imporve the migration

performace

...
...
...
but > > unlike pre copy migration if > > the guess resumes on the dest and the mempry page has not

been

copied

...
...
...
yet > > then it must wait for it to be copied > > and retrive it form the souce host. if the connection too the

souce

...
...
...
host > > is intrupted then the vm cant > > do that and the migration will fail and the instance will

crash.

if

...
...
...
you > > are using precopy migration > > if there is a network partaion during the migration the

migration will

...
...
...
> > fail but the instance will continue > > to run on the source host. > > > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of > > that behavior change. > > > > > Thanks > > > Ignazio > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com>

ha

...
scritto: > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > Hello, I have a problem on stein neutron. When a vm

migrate

...
from one > > > > node > > > > > to another I cannot ping it for several minutes. If in

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... the

...
vm I

...
...
...
put a > > > > > script that ping the gateway continously, the live

migration

...
...
...
works > > > > fine > > > > > > > > and > > > > > I can ping it. Why this happens ? I read something

...
...
...
...
gratuitous > > > > arp. > > > > > > > > qemu does not use gratuitous arp but instead uses an

about older

...
...
...
...
protocal > > > > called > > > > RARP > > > > to do mac address learning. > > > > > > > > what release of openstack are you using. and are you

using

...
iptables > > > > firewall of openvswitch firewall. > > > > > > > > if you are using openvswtich there is is nothing we can

do

until

...
...
...
we > > > > finally delegate vif pluging to os-vif. > > > > currently libvirt handels interface plugging for kernel

ovs

when

...
...
...
using > > > > the > > > > openvswitch firewall driver > > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and > > > > the > > > > neutron patch are > > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt > > > > is > > > > pluging the vif there will always be > > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
> > > > learning > > > > packets will be lost. > > > > > > > > if you are using the iptables firewall and you have opnestack

rock or > > > > later then if you enable post copy live migration > > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the > > > > race > > > > betwen neutron and libvirt so the rarp > > > > packets should not be lost. > > > > > > > > > > > > > Please, help me ? > > > > > Any workaround , please ? > > > > > > > > > > Best Regards > > > > > Ignazio > > > > > > > > > > > >

Ignazio Cassano

1 May 1 May

9:34 a.m.

Hello Sean, to be honest I did not understand what is the difference between the first and second patch but it is due to my poor skill and my poor english. Anycase I would like to test it. I saw I can download files : workaround.py and neutron.py and there is a new option force_legacy_port_binding. How can I test? I must enable the new option under workaround section in the in nova.conf on compute nodes setting it to true? The files downloaded (from first or secondo patch?) must be copied on compute nodes under /usr/lib/python2.7/site_packages nova/conf/workaround.py and nova/network/neutron.py and then restart nova compute service? It should work only for new instances or also for running instances? Sorry for disturbing. Best Regards Ignazio Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Many thanks. Please keep in touch. here are the two patches.

On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com

ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

...
...
migrated and start to respond only when the vm send a network

is packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > Hello, some updated about this issue. > I read someone has got same issue as reported here: > > https://bugs.launchpad.net/neutron/+bug/1866139 > > If you read the discussion, someone tells that the garp must be

sent by

...
...
...
> qemu during live miration. > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
however it looks like this was broken in 2016 in qemu 2.6.0

https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html

...
but was fixed by

...
...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

> So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
> packages I installed on queens (I updated compute and controllers

node

...
...
...
on > queens for obtaining same libvirt/qemu version deployed on

rocky

and

...
...
...
stein). > > On queens live migration on provider network continues to work

fine.

...
...
...
> On rocky and stein not, so I think the issue is related to

openstack

...
...
...
> components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b precreated

...
the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it

on

...
the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that

...
...
...
...
result in the RARP packets being lost.

> > Best Regards > Ignazio Cassano > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com> > ha scritto: > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
> > > firewall. > > > So, can I solve using post copy live migration ??? > > > > so this behavior has always been how nova worked but rocky

can the

...
...
...
...
> > > >

...
...
...
...
> > spec intoduced teh ablity to shorten the outage by pre biding the

port and > > activating it when > > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate. > > > > this reduces the outage time although i cant be fully

elimiated

as

...
...
...
some > > level of packet loss is > > always expected when you live migrate. > > > > so yes enabliy post copy live migration should help but be

aware

that

...
...
...
if a > > network partion happens > > during a post copy live migration the vm will crash and need

to

be

...
...
...
> > restarted. > > it is generally safe to use and will imporve the migration

performace

...
...
...
but > > unlike pre copy migration if > > the guess resumes on the dest and the mempry page has not

been

copied

...
...
...
yet > > then it must wait for it to be copied > > and retrive it form the souce host. if the connection too the

souce

...
...
...
host > > is intrupted then the vm cant > > do that and the migration will fail and the instance will

crash.

if

...
...
...
you > > are using precopy migration > > if there is a network partaion during the migration the

migration will

...
...
...
> > fail but the instance will continue > > to run on the source host. > > > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of > > that behavior change. > > > > > Thanks > > > Ignazio > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com>

ha

...
scritto: > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > Hello, I have a problem on stein neutron. When a vm

migrate

...
from one > > > > node > > > > > to another I cannot ping it for several minutes. If in

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... the

...
vm I

...
...
...
put a > > > > > script that ping the gateway continously, the live

migration

...
...
...
works > > > > fine > > > > > > > > and > > > > > I can ping it. Why this happens ? I read something

...
...
...
...
gratuitous > > > > arp. > > > > > > > > qemu does not use gratuitous arp but instead uses an

about older

...
...
...
...
protocal > > > > called > > > > RARP > > > > to do mac address learning. > > > > > > > > what release of openstack are you using. and are you

using

...
iptables > > > > firewall of openvswitch firewall. > > > > > > > > if you are using openvswtich there is is nothing we can

do

until

...
...
...
we > > > > finally delegate vif pluging to os-vif. > > > > currently libvirt handels interface plugging for kernel

ovs

when

...
...
...
using > > > > the > > > > openvswitch firewall driver > > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and > > > > the > > > > neutron patch are > > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt > > > > is > > > > pluging the vif there will always be > > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
> > > > learning > > > > packets will be lost. > > > > > > > > if you are using the iptables firewall and you have opnestack

rock or > > > > later then if you enable post copy live migration > > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the > > > > race > > > > betwen neutron and libvirt so the rarp > > > > packets should not be lost. > > > > > > > > > > > > > Please, help me ? > > > > > Any workaround , please ? > > > > > > > > > > Best Regards > > > > > Ignazio > > > > > > > > > > > >

Sean Mooney

9:47 a.m.

...

Hello Sean, to be honest I did not understand what is the difference between the first and second patch but it is due to my poor skill and my poor english. no worries. the first patch is the actual change to add the new config option.

...

Anycase I would like to test it. I saw I can download files : workaround.py and neutron.py and there is a new option force_legacy_port_binding. How can I test? I must enable the new option under workaround section in the in nova.conf on compute nodes setting it to true? yes that is correct if you apply the first patch you need to set the new config

On Fri, 2020-05-01 at 18:34 +0200, Ignazio Cassano wrote: the second patch is just a change to force our ci jobs to enable the config option we proably dont want to do that permently which is why i have marked it [DNM] or "do not merge" so it just there to prove the first patch is correct. option in the workarouds section in the nova.conf on the contoler.specifcally the conductor needs to have this set. i dont think this is needed on the compute nodes at least it should not need to be set in the compute node nova.conf for the live migration issue.

...

The files downloaded (from first or secondo patch?) must be copied on compute nodes under /usr/lib/python2.7/site_packages nova/conf/workaround.py and nova/network/neutron.py and then restart nova compute service?

once we have merged this in master ill backport it to the different openstack version back to rocky if you want to test it before then the simpelest thing to do is just manually make the same change unless you are using devstack in which case you could cherry pick the cange to whatever branch you are testing.

...

It should work only for new instances or also for running instances?

it will apply to all instances. what the cange is doing is disabling our detection of neutron supprot for the multiple port binding workflow. we still have compatibility code for supporting old version of neutron. we proably shoudl remove that at some point but when the config option is set we will ignore if you are using old or new neutorn and just fall back to how we did things before rocky. in principal that should make live migration have more packet loss but since people have reproted it actully fixes the issue in this case i have written the patch so you can opt in to the old behaviour. if that work for you in your testing we can continue to keep the workaround and old compatibility code until we resolve the issue when using the multiple port binding flow.

...

Sorry for disturbing.

dont be sorry it fine to ask questions although just be aware its a long weekend so i will not be working monday but i should be back on tuseday ill update the patch then with a release note and a unit test and hopefully i can get some cores to review it.

...

Best Regards Ignazio

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote:

...
Many thanks. Please keep in touch.

here are the two patches. the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com

ha scritto:

...
so bing pragmatic i think the simplest path forward given my other

patches

...
...
have not laned in almost 2 years is to quickly add a workaround config option to

disable

...
...
mulitple port bindign which we can backport and then we can try and work on the actual fix

after.

...
...
acording to https://bugs.launchpad.net/neutron/+bug/1815989 that

shoudl

...
...
serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make

test

...
...
...
as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after

it

...
...
is

...
...
migrated and start to respond only when the vm send a network

packet ,

...
...
for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
...
ha scritto:

> On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > Hello, some updated about this issue. > > I read someone has got same issue as reported here: > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > If you read the discussion, someone tells that the garp must be

sent by

...
...
> > qemu during live miration. > > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
> > it is not correct. > qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
> learning frames > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
>

https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
...
...
...
> however it looks like this was broken in 2016 in qemu 2.6.0 >

https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html

...
...
...
...
> but was fixed by >

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
...
> can you confirm you are not using the broken 2.6.0 release and

are

...
...
using

...
...
> 2.7 or newer or 2.4 and older. > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
> > packages I installed on queens (I updated compute and

controllers

...
...
node

...
...
> > on > > queens for obtaining same libvirt/qemu version deployed on

rocky

...
...
and

...
...
> > stein). > > > > On queens live migration on provider network continues to work

fine.

...
...
> > On rocky and stein not, so I think the issue is related to

openstack

...
...
> > components . > > on queens we have only a singel prot binding and nova blindly

assumes

...
...
...
...
> that the port binding details wont > change when it does a live migration and does not update the xml

for

...
...
the

...
...
> netwrok interfaces. > > the port binding is updated after the migration is complete in > post_livemigration > in rocky+ neutron optionally uses the multiple port bindings

flow to

...
...
...
...
> prebind the port to the destiatnion > so it can update the xml if needed and if post copy live

migration is

...
...
...
...
> enable it will asyconsly activate teh dest port > binding before post_livemigration shortenting the downtime. > > if you are using the iptables firewall os-vif will have

precreated

...
...
the

...
...
> ovs port and intermediate linux bridge before the > migration started which will allow neutron to wire it up (put it

on

...
...
the

...
...
> correct vlan and install security groups) before > the vm completes the migraton. > > if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
> but libvirt deletes it and recreats it too. > as a result there is a race when using openvswitch firewall that

can

...
...
...
...
> result in the RARP packets being lost. > > > > > Best Regards > > Ignazio Cassano > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > smooney@redhat.com> > > ha scritto: > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
> > > > firewall. > > > > So, can I solve using post copy live migration ??? > > > > > > so this behavior has always been how nova worked but rocky

the

...
...
...
...
> > > > > > > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
...
> > > spec intoduced teh ablity to shorten the outage by pre

biding the

...
...
...
...
> > port and > > > activating it when > > > the vm is resumed on the destiation host before we get to pos

live

...
...
> > migrate. > > > > > > this reduces the outage time although i cant be fully

elimiated

...
...
as

...
...
> > some > > > level of packet loss is > > > always expected when you live migrate. > > > > > > so yes enabliy post copy live migration should help but be

aware

...
...
that

...
...
> > if a > > > network partion happens > > > during a post copy live migration the vm will crash and need

to

...
...
be

...
...
> > > restarted. > > > it is generally safe to use and will imporve the migration

performace

...
...
> > but > > > unlike pre copy migration if > > > the guess resumes on the dest and the mempry page has not

been

...
...
copied

...
...
> > yet > > > then it must wait for it to be copied > > > and retrive it form the souce host. if the connection too the

souce

...
...
> > host > > > is intrupted then the vm cant > > > do that and the migration will fail and the instance will

crash.

...
...
if

...
...
> > you > > > are using precopy migration > > > if there is a network partaion during the migration the

migration will

...
...
> > > fail but the instance will continue > > > to run on the source host. > > > > > > so while i would still recommend using it, i it just good to

be

...
...
aware

...
...
> > of > > > that behavior change. > > > > > > > Thanks > > > > Ignazio > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com>

ha

...
...
...
...
> > scritto: > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > Hello, I have a problem on stein neutron. When a vm

migrate

...
...
...
...
> > from one > > > > > > node > > > > > > to another I cannot ping it for several minutes. If in

the

...
...
vm I

...
...
> > put a > > > > > > script that ping the gateway continously, the live

migration

...
...
> > works > > > > > > fine > > > > > > > > > > and > > > > > > I can ping it. Why this happens ? I read something

about

...
...
...
...
> > gratuitous > > > > > > arp. > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

older

...
...
...
...
> > protocal > > > > > > called > > > > > RARP > > > > > to do mac address learning. > > > > > > > > > > what release of openstack are you using. and are you

using

...
...
...
...
> > iptables > > > > > firewall of openvswitch firewall. > > > > > > > > > > if you are using openvswtich there is is nothing we can

do

...
...
until

...
...
> > we > > > > > finally delegate vif pluging to os-vif. > > > > > currently libvirt handels interface plugging for kernel

ovs

...
...
when

...
...
> > using > > > > > > the > > > > > openvswitch firewall driver > > > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
> > and > > > > > > the > > > > > neutron patch are > > > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
> > libvirt > > > > > > is > > > > > pluging the vif there will always be > > > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
> > > > > > learning > > > > > packets will be lost. > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
...
...
> > rock or > > > > > later then if you enable post copy live migration > > > > > it should reduce the downtime. in this conficution we do

not

...
...
have

...
...
> > the > > > > > > race > > > > > betwen neutron and libvirt so the rarp > > > > > packets should not be lost. > > > > > > > > > > > > > > > > Please, help me ? > > > > > > Any workaround , please ? > > > > > > > > > > > > Best Regards > > > > > > Ignazio > > > > > > > > > > > > > > > > > >

Ignazio Cassano

9:55 a.m.

Thanks, have a nice long weekend Ignazio Il Ven 1 Mag 2020, 18:47 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Hello Sean, to be honest I did not understand what is the difference between the first and second patch but it is due to my poor skill and my poor english. no worries. the first patch is the actual change to add the new config

On Fri, 2020-05-01 at 18:34 +0200, Ignazio Cassano wrote: option. the second patch is just a change to force our ci jobs to enable the config option we proably dont want to do that permently which is why i have marked it [DNM] or "do not merge" so it just there to prove the first patch is correct.

...
Anycase I would like to test it. I saw I can download files : workaround.py and neutron.py and there is a new option force_legacy_port_binding. How can I test? I must enable the new option under workaround section in the in nova.conf on compute nodes setting it to true? yes that is correct if you apply the first patch you need to set the new config option in the workarouds section in the nova.conf on the contoler.specifcally the conductor needs to have this set. i dont think this is needed on the compute nodes at least it should not need to be set in the compute node nova.conf for the live migration issue.

...
The files downloaded (from first or secondo patch?) must be copied on compute nodes under /usr/lib/python2.7/site_packages nova/conf/workaround.py and nova/network/neutron.py and then restart nova compute service?

once we have merged this in master ill backport it to the different openstack version back to rocky if you want to test it before then the simpelest thing to do is just manually make the same change unless you are using devstack in which case you could cherry pick the cange to whatever branch you are testing.

...
It should work only for new instances or also for running instances?

it will apply to all instances. what the cange is doing is disabling our detection of neutron supprot for the multiple port binding workflow. we still have compatibility code for supporting old version of neutron. we proably shoudl remove that at some point but when the config option is set we will ignore if you are using old or new neutorn and just fall back to how we did things before rocky.

in principal that should make live migration have more packet loss but since people have reproted it actully fixes the issue in this case i have written the patch so you can opt in to the old behaviour.

if that work for you in your testing we can continue to keep the workaround and old compatibility code until we resolve the issue when using the multiple port binding flow.

...
Sorry for disturbing.

...
Best Regards Ignazio

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...
On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote:

...
Many thanks. Please keep in touch.

here are the two patches. the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test

...
...
locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney < smooney@redhat.com

ha scritto:

...
so bing pragmatic i think the simplest path forward given my other

patches

...
...
have not laned in almost 2 years is to quickly add a workaround config option to

disable

...
...
mulitple port bindign which we can backport and then we can try and work on the actual fix

after.

...
...
acording to https://bugs.launchpad.net/neutron/+bug/1815989 that

shoudl

...
...
serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make

test

...
...
...
as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

> Hello Sean, > the following is the configuration on my compute nodes: > [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt > libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 > libvirt-daemon-kvm-4.5.0-33.el7.x86_64 > libvirt-libs-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 > libvirt-client-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 > libvirt-daemon-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 > libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 > libvirt-bash-completion-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 > libvirt-python-4.5.0-1.el7.x86_64 > libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 > [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu > qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 > qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 > centos-release-qemu-ev-1.0-4.el7.centos.noarch > ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch > qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64 > > > As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
> > firewall_driver = iptables_hybrid > > I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
> environment and the > same firewall driver. > Live migration on provider network on queens works fine. > It does not work fine on rocky and stein (vm lost connection after

it

...
...
is

...
> migrated and start to respond only when the vm send a network

packet ,

...
...
for

...
> example when chrony pools the time server). > > Ignazio > > > > Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
> ha scritto: > > > On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > > Hello, some updated about this issue. > > > I read someone has got same issue as reported here: > > > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > > > If you read the discussion, someone tells that the garp

must be

...
sent by

...
> > > qemu during live miration. > > > If this is true, this means on rocky/stein the

qemu/libvirt are

...
bugged.

...
> > > > it is not correct. > > qemu/libvir thas alsway used RARP which predates GARP to

serve as

...
its mac

...
> > learning frames > > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
> >

https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
...
...
> > however it looks like this was broken in 2016 in qemu 2.6.0 > >

https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html

...
...
...
> > but was fixed by > >

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
...
> > can you confirm you are not using the broken 2.6.0 release and

are

...
...
using

...
> > 2.7 or newer or 2.4 and older. > > > > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
> > > packages I installed on queens (I updated compute and

controllers

...
...
node

...
> > > > on > > > queens for obtaining same libvirt/qemu version deployed on

rocky

...
...
and

...
> > > > stein). > > > > > > On queens live migration on provider network continues to

work

...
fine.

...
> > > On rocky and stein not, so I think the issue is related to

openstack

...
> > > components . > > > > on queens we have only a singel prot binding and nova blindly

assumes

...
...
...
> > that the port binding details wont > > change when it does a live migration and does not update the xml

for

...
...
the

...
> > netwrok interfaces. > > > > the port binding is updated after the migration is complete

in

...
...
> > post_livemigration > > in rocky+ neutron optionally uses the multiple port bindings

flow to

...
...
...
> > prebind the port to the destiatnion > > so it can update the xml if needed and if post copy live

migration is

...
...
...
> > enable it will asyconsly activate teh dest port > > binding before post_livemigration shortenting the downtime. > > > > if you are using the iptables firewall os-vif will have

precreated

...
...
the

...
> > ovs port and intermediate linux bridge before the > > migration started which will allow neutron to wire it up

(put it

on

...
...
the

...
> > correct vlan and install security groups) before > > the vm completes the migraton. > > > > if you are using the ovs firewall os-vif still precreates

teh ovs

...
port

...
> > but libvirt deletes it and recreats it too. > > as a result there is a race when using openvswitch firewall

...
...
can

...
...
...
> > result in the RARP packets being lost. > > > > > > > > Best Regards > > > Ignazio Cassano > > > > > > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > > > smooney@redhat.com> > > > ha scritto: > > > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
> > > > > firewall. > > > > > So, can I solve using post copy live migration ??? > > > > > > > > so this behavior has always been how nova worked but

rocky

...
the

...
...
...
> > > > > > > > > > > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
...
> > > > spec intoduced teh ablity to shorten the outage by pre

biding the

...
...
...
> > > > port and > > > > activating it when > > > > the vm is resumed on the destiation host before we get to pos

live

...
> > > > migrate. > > > > > > > > this reduces the outage time although i cant be fully

elimiated

...
...
as

...
> > > > some > > > > level of packet loss is > > > > always expected when you live migrate. > > > > > > > > so yes enabliy post copy live migration should help but

be

aware

...
...
that

...
> > > > if a > > > > network partion happens > > > > during a post copy live migration the vm will crash and

need

to

...
...
be

...
> > > > restarted. > > > > it is generally safe to use and will imporve the

migration

...
performace

...
> > > > but > > > > unlike pre copy migration if > > > > the guess resumes on the dest and the mempry page has not

been

...
...
copied

...
> > > > yet > > > > then it must wait for it to be copied > > > > and retrive it form the souce host. if the connection

too the

...
souce

...
> > > > host > > > > is intrupted then the vm cant > > > > do that and the migration will fail and the instance will

crash.

...
...
if

...
> > > > you > > > > are using precopy migration > > > > if there is a network partaion during the migration the

migration will

...
> > > > fail but the instance will continue > > > > to run on the source host. > > > > > > > > so while i would still recommend using it, i it just

good to

be

...
...
aware

...
> > > > of > > > > that behavior change. > > > > > > > > > Thanks > > > > > Ignazio > > > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <

smooney@redhat.com>

ha

...
...
...
> > > > scritto: > > > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > > Hello, I have a problem on stein neutron. When a vm

migrate

...
...
...
> > > > from one > > > > > > > > node > > > > > > > to another I cannot ping it for several minutes. If in

the

...
...
vm I

...
> > > > put a > > > > > > > script that ping the gateway continously, the live

migration

...
> > > > works > > > > > > > > fine > > > > > > > > > > > > and > > > > > > > I can ping it. Why this happens ? I read something

about

...
...
...
> > > > gratuitous > > > > > > > > arp. > > > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

older

...
...
...
> > > > protocal > > > > > > > > called > > > > > > RARP > > > > > > to do mac address learning. > > > > > > > > > > > > what release of openstack are you using. and are you

using

...
...
...
> > > > iptables > > > > > > firewall of openvswitch firewall. > > > > > > > > > > > > if you are using openvswtich there is is nothing we can

do

...
...
until

...
> > > > we > > > > > > finally delegate vif pluging to os-vif. > > > > > > currently libvirt handels interface plugging for

kernel

ovs

...
...
when

...
> > > > using > > > > > > > > the > > > > > > openvswitch firewall driver > > > > > > https://review.opendev.org/#/c/602432/ would adress

dont be sorry it fine to ask questions although just be aware its a long weekend so i will not be working monday but i should be back on tuseday ill update the patch then with a release note and a unit test and hopefully i can get some cores to review it. this that that

...
...
...
...
but it

...
> > > > and > > > > > > > > the > > > > > > neutron patch are > > > > > > https://review.opendev.org/#/c/640258 rather out

dated.

...
while

...
> > > > libvirt > > > > > > > > is > > > > > > pluging the vif there will always be > > > > > > a race condition where the RARP packets sent by qemu

and

...
then mac

...
> > > > > > > > learning > > > > > > packets will be lost. > > > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
...
> > > > rock or > > > > > > later then if you enable post copy live migration > > > > > > it should reduce the downtime. in this conficution we do

not

...
...
have

...
> > > > the > > > > > > > > race > > > > > > betwen neutron and libvirt so the rarp > > > > > > packets should not be lost. > > > > > > > > > > > > > > > > > > > Please, help me ? > > > > > > > Any workaround , please ? > > > > > > > > > > > > > > Best Regards > > > > > > > Ignazio > > > > > > > > > > > > > > > > > > > > > > > >

Ignazio Cassano

12:21 p.m.

Hello Sean, I hope you'll read this email next week (this is only a report on my testing and I know you are not working now, but I am writing to keep track of testing) I tested the patch on stein. I downloaded the new workarounds.py file and I copied it on controllers under /usr/lib/python2.7/site-packages/nova/conf directory I downloaded neutron.py and I copied it on controllers under /usr/lib/python2.7/site-packages/nova/network directory On each controllers I added in nova.conf workarounds section the following: force_legacy_port_binding = True I restarted all nova services. But when I try a live migration I got the following error in /var/log/nova/nova-api.log: 35f85547c4add392a221af1aab - default default] 10.102.184.193 "GET /v2.1/os-services?binary=nova-compute HTTP/1.1" status: 200 len: 968 time: 0.0685341 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi [req-1a449846-51bf-46e3-afdb-43469a0226ea 0c7a2d6006614fe2b3e81e47377dd2a9 c26f8d35f85547c4add392a221af1aab - default default] Unexpected exception in API method: NoSuchOptError: no such option enable_consoleauth in group [workarounds] 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi Traceback (most recent call last): 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 671, in wrapped 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return f(*args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 110, in wrapper 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return func(*args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 110, in wrapper 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return func(*args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 110, in wrapper 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return func(*args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 110, in wrapper 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return func(*args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/migrate_server.py", line 141, in _migrate_live 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi async_) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 207, in inner 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return function(self, context, instance, *args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 215, in _wrapped 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return fn(self, context, instance, *args, **kwargs) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 155, in inner 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return f(self, context, instance, *args, **kw) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 4756, in live_migrate 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi if CONF.cells.enable or CONF.workarounds.enable_consoleauth 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 3124, in __getattr__ 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi return self._conf._get(name, self._group) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2621, in _get 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi value, loc = self._do_get(name, group, namespace) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2639, in _do_get 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi info = self._get_opt_info(name, group) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2839, in _get_opt_info 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi raise NoSuchOptError(opt_name, group) 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi NoSuchOptError: no such option enable_consoleauth in group [workarounds] 2020-05-01 19:39:08.140 3225168 ERROR nova.api.openstack.wsgi 2020-05-01 19:39:08.151 3225168 INFO nova.api.openstack.wsgi [req-1a449846-51bf-46e3-afdb-43469a0226ea 0c7a2d6006614fe2b3e81e47377dd2a9 c26f8d35f85547c4add392a221af1aab - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'oslo_config.cfg.NoSuchOptError'> I checked nova.conf and I not found enable_consoleauth under workarounds section. I 'll wait your checks next week Have fun on the long weekend Ignazio Il giorno ven 1 mag 2020 alle ore 18:47 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Hello Sean, to be honest I did not understand what is the difference between the first and second patch but it is due to my poor skill and my poor english. no worries. the first patch is the actual change to add the new config

On Fri, 2020-05-01 at 18:34 +0200, Ignazio Cassano wrote: option. the second patch is just a change to force our ci jobs to enable the config option we proably dont want to do that permently which is why i have marked it [DNM] or "do not merge" so it just there to prove the first patch is correct.

...
Anycase I would like to test it. I saw I can download files : workaround.py and neutron.py and there is a new option force_legacy_port_binding. How can I test? I must enable the new option under workaround section in the in nova.conf on compute nodes setting it to true? yes that is correct if you apply the first patch you need to set the new config option in the workarouds section in the nova.conf on the contoler.specifcally the conductor needs to have this set. i dont think this is needed on the compute nodes at least it should not need to be set in the compute node nova.conf for the live migration issue.

...
The files downloaded (from first or secondo patch?) must be copied on compute nodes under /usr/lib/python2.7/site_packages nova/conf/workaround.py and nova/network/neutron.py and then restart nova compute service?

once we have merged this in master ill backport it to the different openstack version back to rocky if you want to test it before then the simpelest thing to do is just manually make the same change unless you are using devstack in which case you could cherry pick the cange to whatever branch you are testing.

...
It should work only for new instances or also for running instances?

it will apply to all instances. what the cange is doing is disabling our detection of neutron supprot for the multiple port binding workflow. we still have compatibility code for supporting old version of neutron. we proably shoudl remove that at some point but when the config option is set we will ignore if you are using old or new neutorn and just fall back to how we did things before rocky.

in principal that should make live migration have more packet loss but since people have reproted it actully fixes the issue in this case i have written the patch so you can opt in to the old behaviour.

if that work for you in your testing we can continue to keep the workaround and old compatibility code until we resolve the issue when using the multiple port binding flow.

...
Sorry for disturbing.

dont be sorry it fine to ask questions although just be aware its a long weekend so i will not be working monday but i should be back on tuseday ill update the patch then with a release note and a unit test and hopefully i can get some cores to review it.

...
Best Regards Ignazio

Ignazio Cassano

2 May 2 May

1:43 a.m.

Hello Sean, I modified the no workloads.py to add the consoleauth code, so now it does note returns errors during live migration phase, as I wrote in my last email. Keep in mind my stein is from an upgrade. Sorry if I am not sending all email history here, but if message body is too big the email needs the moderator approval. Anycase, I added the following code: cfg.BoolOpt( 'enable_consoleauth', default=False, deprecated_for_removal=True, deprecated_since="18.0.0", deprecated_reason=""" This option has been added as deprecated originally because it is used for avoiding a upgrade issue and it will not be used in the future. See the help text for more details. """, help=""" Enable the consoleauth service to avoid resetting unexpired consoles. Console token authorizations have moved from the ``nova-consoleauth`` service to the database, so all new consoles will be supported by the database backend. With this, consoles that existed before database backend support will be reset. For most operators, this should be a minimal disruption as the default TTL of a console token is 10 minutes. Operators that have much longer token TTL configured or otherwise wish to avoid immediately resetting all existing consoles can enable this flag to continue using the ``nova-consoleauth`` service in addition to the database backend. Once all of the old ``nova-consoleauth`` supported console tokens have expired, this flag should be disabled. For example, if a deployment has configured a token TTL of one hour, the operator may disable the flag, one hour after deploying the new code during an upgrade. .. note:: Cells v1 was not converted to use the database backend for console token authorizations. Cells v1 console token authorizations will continue to be supported by the ``nova-consoleauth`` service and use of the ``[workarounds]/enable_consoleauth`` option does not apply to Cells v1 users. Related options: * ``[consoleauth]/token_ttl`` """), Now the live migration starts e the instance is moved but the it continues to be unreachable after live migration. It starts to respond only when it starts a connection (for example a polling to ntp server). If I disable chrony in the instance, it stop to respond for ever. Best Regards Ignazio

Ignazio Cassano

8:40 a.m.

Hello Sean, I am continuing my test (so you we'll have to read a lot :-) ) If I understood well file neutron.py contains a patch for /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py for reading the configuration force_legacy_port_binding. If it is true it returns false. I patched the api.py and inserting a LOG.info call I saw it reads the variable but it seems do nothing and the migrate instance stop to respond. Best Regards Ignazio Il giorno sab 2 mag 2020 alle ore 10:43 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...

Hello Sean, I modified the no workloads.py to add the consoleauth code, so now it does note returns errors during live migration phase, as I wrote in my last email. Keep in mind my stein is from an upgrade. Sorry if I am not sending all email history here, but if message body is too big the email needs the moderator approval. Anycase, I added the following code:

cfg.BoolOpt( 'enable_consoleauth', default=False, deprecated_for_removal=True, deprecated_since="18.0.0", deprecated_reason=""" This option has been added as deprecated originally because it is used for avoiding a upgrade issue and it will not be used in the future. See the help text for more details. """, help=""" Enable the consoleauth service to avoid resetting unexpired consoles.

Console token authorizations have moved from the ``nova-consoleauth`` service to the database, so all new consoles will be supported by the database backend. With this, consoles that existed before database backend support will be reset. For most operators, this should be a minimal disruption as the default TTL of a console token is 10 minutes.

Operators that have much longer token TTL configured or otherwise wish to avoid immediately resetting all existing consoles can enable this flag to continue using the ``nova-consoleauth`` service in addition to the database backend. Once all of the old ``nova-consoleauth`` supported console tokens have expired, this flag should be disabled. For example, if a deployment has configured a token TTL of one hour, the operator may disable the flag, one hour after deploying the new code during an upgrade.

.. note:: Cells v1 was not converted to use the database backend for console token authorizations. Cells v1 console token authorizations will continue to be supported by the ``nova-consoleauth`` service and use of the ``[workarounds]/enable_consoleauth`` option does not apply to Cells v1 users.

Related options:

* ``[consoleauth]/token_ttl`` """),

Now the live migration starts e the instance is moved but the it continues to be unreachable after live migration. It starts to respond only when it starts a connection (for example a polling to ntp server). If I disable chrony in the instance, it stop to respond for ever. Best Regards Ignazio

Ignazio Cassano

5 May 5 May

4:39 a.m.

Hello Sean, if you do not want to spend you time for configuring a test openstack environment I am available for schedule a call where I can share my desktop and we could test on rocky and stein . Let me know if you can. Best Regards Ignazio Il giorno sab 2 mag 2020 alle ore 17:40 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...

Hello Sean, I am continuing my test (so you we'll have to read a lot :-) ) If I understood well file neutron.py contains a patch for /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py for reading the configuration force_legacy_port_binding. If it is true it returns false. I patched the api.py and inserting a LOG.info call I saw it reads the variable but it seems do nothing and the migrate instance stop to respond. Best Regards Ignazio

Il giorno sab 2 mag 2020 alle ore 10:43 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, I modified the no workloads.py to add the consoleauth code, so now it does note returns errors during live migration phase, as I wrote in my last email. Keep in mind my stein is from an upgrade. Sorry if I am not sending all email history here, but if message body is too big the email needs the moderator approval. Anycase, I added the following code:

cfg.BoolOpt( 'enable_consoleauth', default=False, deprecated_for_removal=True, deprecated_since="18.0.0", deprecated_reason=""" This option has been added as deprecated originally because it is used for avoiding a upgrade issue and it will not be used in the future. See the help text for more details. """, help=""" Enable the consoleauth service to avoid resetting unexpired consoles.

Console token authorizations have moved from the ``nova-consoleauth`` service to the database, so all new consoles will be supported by the database backend. With this, consoles that existed before database backend support will be reset. For most operators, this should be a minimal disruption as the default TTL of a console token is 10 minutes.

Operators that have much longer token TTL configured or otherwise wish to avoid immediately resetting all existing consoles can enable this flag to continue using the ``nova-consoleauth`` service in addition to the database backend. Once all of the old ``nova-consoleauth`` supported console tokens have expired, this flag should be disabled. For example, if a deployment has configured a token TTL of one hour, the operator may disable the flag, one hour after deploying the new code during an upgrade.

.. note:: Cells v1 was not converted to use the database backend for console token authorizations. Cells v1 console token authorizations will continue to be supported by the ``nova-consoleauth`` service and use of the ``[workarounds]/enable_consoleauth`` option does not apply to Cells v1 users.

Related options:

* ``[consoleauth]/token_ttl`` """),

Now the live migration starts e the instance is moved but the it continues to be unreachable after live migration. It starts to respond only when it starts a connection (for example a polling to ntp server). If I disable chrony in the instance, it stop to respond for ever. Best Regards Ignazio

Ignazio Cassano

6 May 6 May

9:16 a.m.

Hello, first of all sorry for my insitence but I have would to know if also openstack train is affected by this bug. Thanks & Regards Ignazio Il Mar 5 Mag 2020, 13:39 Ignazio Cassano <ignaziocassano@gmail.com> ha scritto:

...

Hello Sean, if you do not want to spend you time for configuring a test openstack environment I am available for schedule a call where I can share my desktop and we could test on rocky and stein . Let me know if you can. Best Regards Ignazio

Il giorno sab 2 mag 2020 alle ore 17:40 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, I am continuing my test (so you we'll have to read a lot :-) ) If I understood well file neutron.py contains a patch for /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py for reading the configuration force_legacy_port_binding. If it is true it returns false. I patched the api.py and inserting a LOG.info call I saw it reads the variable but it seems do nothing and the migrate instance stop to respond. Best Regards Ignazio

Il giorno sab 2 mag 2020 alle ore 10:43 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, I modified the no workloads.py to add the consoleauth code, so now it does note returns errors during live migration phase, as I wrote in my last email. Keep in mind my stein is from an upgrade. Sorry if I am not sending all email history here, but if message body is too big the email needs the moderator approval. Anycase, I added the following code:

cfg.BoolOpt( 'enable_consoleauth', default=False, deprecated_for_removal=True, deprecated_since="18.0.0", deprecated_reason=""" This option has been added as deprecated originally because it is used for avoiding a upgrade issue and it will not be used in the future. See the help text for more details. """, help=""" Enable the consoleauth service to avoid resetting unexpired consoles.

Console token authorizations have moved from the ``nova-consoleauth`` service to the database, so all new consoles will be supported by the database backend. With this, consoles that existed before database backend support will be reset. For most operators, this should be a minimal disruption as the default TTL of a console token is 10 minutes.

Operators that have much longer token TTL configured or otherwise wish to avoid immediately resetting all existing consoles can enable this flag to continue using the ``nova-consoleauth`` service in addition to the database backend. Once all of the old ``nova-consoleauth`` supported console tokens have expired, this flag should be disabled. For example, if a deployment has configured a token TTL of one hour, the operator may disable the flag, one hour after deploying the new code during an upgrade.

.. note:: Cells v1 was not converted to use the database backend for console token authorizations. Cells v1 console token authorizations will continue to be supported by the ``nova-consoleauth`` service and use of the ``[workarounds]/enable_consoleauth`` option does not apply to Cells v1 users.

Related options:

* ``[consoleauth]/token_ttl`` """),

Now the live migration starts e the instance is moved but the it continues to be unreachable after live migration. It starts to respond only when it starts a connection (for example a polling to ntp server). If I disable chrony in the instance, it stop to respond for ever. Best Regards Ignazio

Ignazio Cassano

17 Nov 17 Nov

10:16 p.m.

Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Many thanks. Please keep in touch. here are the two patches.

On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com

ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

...
...
migrated and start to respond only when the vm send a network

is packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > Hello, some updated about this issue. > I read someone has got same issue as reported here: > > https://bugs.launchpad.net/neutron/+bug/1866139 > > If you read the discussion, someone tells that the garp must be

sent by

...
...
...
> qemu during live miration. > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
however it looks like this was broken in 2016 in qemu 2.6.0

https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html

...
but was fixed by

...
...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

> So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
> packages I installed on queens (I updated compute and controllers

node

...
...
...
on > queens for obtaining same libvirt/qemu version deployed on

rocky

and

...
...
...
stein). > > On queens live migration on provider network continues to work

fine.

...
...
...
> On rocky and stein not, so I think the issue is related to

openstack

...
...
...
> components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b precreated

...
the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it

on

...
the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that

...
...
...
...
result in the RARP packets being lost.

> > Best Regards > Ignazio Cassano > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com> > ha scritto: > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
> > > firewall. > > > So, can I solve using post copy live migration ??? > > > > so this behavior has always been how nova worked but rocky

can the

...
...
...
...
> > > >

...
...
...
...
> > spec intoduced teh ablity to shorten the outage by pre biding the

port and > > activating it when > > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate. > > > > this reduces the outage time although i cant be fully

elimiated

as

...
...
...
some > > level of packet loss is > > always expected when you live migrate. > > > > so yes enabliy post copy live migration should help but be

aware

that

...
...
...
if a > > network partion happens > > during a post copy live migration the vm will crash and need

to

be

...
...
...
> > restarted. > > it is generally safe to use and will imporve the migration

performace

...
...
...
but > > unlike pre copy migration if > > the guess resumes on the dest and the mempry page has not

been

copied

...
...
...
yet > > then it must wait for it to be copied > > and retrive it form the souce host. if the connection too the

souce

...
...
...
host > > is intrupted then the vm cant > > do that and the migration will fail and the instance will

crash.

if

...
...
...
you > > are using precopy migration > > if there is a network partaion during the migration the

migration will

...
...
...
> > fail but the instance will continue > > to run on the source host. > > > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of > > that behavior change. > > > > > Thanks > > > Ignazio > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com>

ha

...
scritto: > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > Hello, I have a problem on stein neutron. When a vm

migrate

...
from one > > > > node > > > > > to another I cannot ping it for several minutes. If in

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... the

...
vm I

...
...
...
put a > > > > > script that ping the gateway continously, the live

migration

...
...
...
works > > > > fine > > > > > > > > and > > > > > I can ping it. Why this happens ? I read something

...
...
...
...
gratuitous > > > > arp. > > > > > > > > qemu does not use gratuitous arp but instead uses an

about older

...
...
...
...
protocal > > > > called > > > > RARP > > > > to do mac address learning. > > > > > > > > what release of openstack are you using. and are you

using

...
iptables > > > > firewall of openvswitch firewall. > > > > > > > > if you are using openvswtich there is is nothing we can

do

until

...
...
...
we > > > > finally delegate vif pluging to os-vif. > > > > currently libvirt handels interface plugging for kernel

ovs

when

...
...
...
using > > > > the > > > > openvswitch firewall driver > > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and > > > > the > > > > neutron patch are > > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt > > > > is > > > > pluging the vif there will always be > > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
> > > > learning > > > > packets will be lost. > > > > > > > > if you are using the iptables firewall and you have opnestack

rock or > > > > later then if you enable post copy live migration > > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the > > > > race > > > > betwen neutron and libvirt so the rarp > > > > packets should not be lost. > > > > > > > > > > > > > Please, help me ? > > > > > Any workaround , please ? > > > > > > > > > > Best Regards > > > > > Ignazio > > > > > > > > > > > >

Ignazio Cassano

9 Mar 9 Mar

10:57 p.m.

Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...

Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...
...
Many thanks. Please keep in touch. here are the two patches.

On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney < smooney@redhat.com> ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

...
...
migrated and start to respond only when the vm send a network

is packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
...
ha scritto:

> On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > Hello, some updated about this issue. > > I read someone has got same issue as reported here: > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > If you read the discussion, someone tells that the garp must be

sent by

...
...
> > qemu during live miration. > > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
> > it is not correct. > qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
> learning frames > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
> https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html > however it looks like this was broken in 2016 in qemu 2.6.0 > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html > but was fixed by >

...
...
...
> can you confirm you are not using the broken 2.6.0 release and are

using

...
...
> 2.7 or newer or 2.4 and older. > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
> > packages I installed on queens (I updated compute and controllers

node

...
...
> > on > > queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
> > stein). > > > > On queens live migration on provider network continues to work

fine.

...
...
> > On rocky and stein not, so I think the issue is related to

openstack

...
...
> > components . > > on queens we have only a singel prot binding and nova blindly assumes > that the port binding details wont > change when it does a live migration and does not update the xml for

the

...
...
> netwrok interfaces. > > the port binding is updated after the migration is complete in > post_livemigration > in rocky+ neutron optionally uses the multiple port bindings flow to > prebind the port to the destiatnion > so it can update the xml if needed and if post copy live migration is > enable it will asyconsly activate teh dest port > binding before post_livemigration shortenting the downtime. > > if you are using the iptables firewall os-vif will have

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b precreated

...
the

...
...
> ovs port and intermediate linux bridge before the > migration started which will allow neutron to wire it up (put

it on

...
the

...
...
> correct vlan and install security groups) before > the vm completes the migraton. > > if you are using the ovs firewall os-vif still precreates teh

ovs

...
port

...
...
> but libvirt deletes it and recreats it too. > as a result there is a race when using openvswitch firewall

...
...
...
> result in the RARP packets being lost. > > > > > Best Regards > > Ignazio Cassano > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > smooney@redhat.com> > > ha scritto: > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
> > > > firewall. > > > > So, can I solve using post copy live migration ??? > > > > > > so this behavior has always been how nova worked but rocky

that can the

...
...
...
> > > > > > > >

...
...
...
> > > spec intoduced teh ablity to shorten the outage by pre biding the > > port and > > > activating it when > > > the vm is resumed on the destiation host before we get to

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... pos

...
live

...
...
> > migrate. > > > > > > this reduces the outage time although i cant be fully

elimiated

...
as

...
...
> > some > > > level of packet loss is > > > always expected when you live migrate. > > > > > > so yes enabliy post copy live migration should help but be

aware

...
that

...
...
> > if a > > > network partion happens > > > during a post copy live migration the vm will crash and

need to

...
be

...
...
> > > restarted. > > > it is generally safe to use and will imporve the migration

performace

...
...
> > but > > > unlike pre copy migration if > > > the guess resumes on the dest and the mempry page has not

been

...
copied

...
...
> > yet > > > then it must wait for it to be copied > > > and retrive it form the souce host. if the connection too

the

...
souce

...
...
> > host > > > is intrupted then the vm cant > > > do that and the migration will fail and the instance will

crash.

...
if

...
...
> > you > > > are using precopy migration > > > if there is a network partaion during the migration the

migration will

...
...
> > > fail but the instance will continue > > > to run on the source host. > > > > > > so while i would still recommend using it, i it just good

to be

...
aware

...
...
> > of > > > that behavior change. > > > > > > > Thanks > > > > Ignazio > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com>

...
...
...
> > scritto: > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > Hello, I have a problem on stein neutron. When a vm migrate > > from one > > > > > > node > > > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
> > put a > > > > > > script that ping the gateway continously, the live

migration

...
...
> > works > > > > > > fine > > > > > > > > > > and > > > > > > I can ping it. Why this happens ? I read something about > > gratuitous > > > > > > arp. > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

...
...
...
> > protocal > > > > > > called > > > > > RARP > > > > > to do mac address learning. > > > > > > > > > > what release of openstack are you using. and are you using > > iptables > > > > > firewall of openvswitch firewall. > > > > > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
> > we > > > > > finally delegate vif pluging to os-vif. > > > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
> > using > > > > > > the > > > > > openvswitch firewall driver > > > > > https://review.opendev.org/#/c/602432/ would adress

ha older that

...
but it

...
...
> > and > > > > > > the > > > > > neutron patch are > > > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
> > libvirt > > > > > > is > > > > > pluging the vif there will always be > > > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
> > > > > > learning > > > > > packets will be lost. > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
...
> > rock or > > > > > later then if you enable post copy live migration > > > > > it should reduce the downtime. in this conficution we do not

have

...
...
> > the > > > > > > race > > > > > betwen neutron and libvirt so the rarp > > > > > packets should not be lost. > > > > > > > > > > > > > > > > Please, help me ? > > > > > > Any workaround , please ? > > > > > > > > > > > > Best Regards > > > > > > Ignazio > > > > > > > > > > > > > > > > > >

Tobias Urdin

11 Mar 11 Mar

8:18 a.m.

...

Many thanks. Please keep in touch. here are the two patches.

Hello, Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved. Best regards ________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano <ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled. this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs. i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...

Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be

sent by

...
...
...
...
qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
...
packages I installed on queens (I updated compute and controllers

node

...
...
...
on

...
queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
...
stein).

...
On queens live migration on provider network continues to work

fine.

...
...
...
...
On rocky and stein not, so I think the issue is related to

openstack

...
...
...
...
components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated

the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on

the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
ha scritto:

> On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
...
> > firewall. > > So, can I solve using post copy live migration ??? > > so this behavior has always been how nova worked but rocky the > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
...
> spec intoduced teh ablity to shorten the outage by pre biding the

port and

...
> activating it when > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate.

...
> > this reduces the outage time although i cant be fully elimiated

as

...
...
...
some

...
> level of packet loss is > always expected when you live migrate. > > so yes enabliy post copy live migration should help but be aware

that

...
...
...
if a

...
> network partion happens > during a post copy live migration the vm will crash and need to

be

...
...
...
...
> restarted. > it is generally safe to use and will imporve the migration

performace

...
...
...
but

...
> unlike pre copy migration if > the guess resumes on the dest and the mempry page has not been

copied

...
...
...
yet

...
> then it must wait for it to be copied > and retrive it form the souce host. if the connection too the

souce

...
...
...
host

...
> is intrupted then the vm cant > do that and the migration will fail and the instance will crash.

if

...
...
...
you

...
> are using precopy migration > if there is a network partaion during the migration the

migration will

...
...
...
...
> fail but the instance will continue > to run on the source host. > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of

...
> that behavior change. > > > Thanks > > Ignazio > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha

scritto:

...
> > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > Hello, I have a problem on stein neutron. When a vm migrate

from one

...
> > node > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
...
put a

...
> > > > script that ping the gateway continously, the live

migration

...
...
...
works

...
> > fine > > > > > > and > > > > I can ping it. Why this happens ? I read something about

gratuitous

...
> > arp. > > > > > > qemu does not use gratuitous arp but instead uses an older

protocal

...
> > called > > > RARP > > > to do mac address learning. > > > > > > what release of openstack are you using. and are you using

iptables

...
> > > firewall of openvswitch firewall. > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
...
we

...
> > > finally delegate vif pluging to os-vif. > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
...
using

...
> > the > > > openvswitch firewall driver > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and

...
> > the > > > neutron patch are > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt

...
> > is > > > pluging the vif there will always be > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
...
> > learning > > > packets will be lost. > > > > > > if you are using the iptables firewall and you have opnestack

rock or

...
> > > later then if you enable post copy live migration > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the

...
> > race > > > betwen neutron and libvirt so the rarp > > > packets should not be lost. > > > > > > > > > > Please, help me ? > > > > Any workaround , please ? > > > > > > > > Best Regards > > > > Ignazio > > > > > > > >

Ignazio Cassano

10:43 p.m.

Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the same. I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com> ha scritto:

...

Hello,

Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but

are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved.

Best regards

------------------------------ *From:* Ignazio Cassano <ignaziocassano@gmail.com> *Sent:* Wednesday, March 10, 2021 7:57:21 AM *To:* Sean Mooney *Cc:* openstack-discuss; Slawek Kaplonski *Subject:* Re: [stein][neutron] gratuitous arp

Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio

Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com> ha scritto:

...
...
Many thanks. Please keep in touch. here are the two patches.

On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney < smooney@redhat.com> ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

> Hello Sean, > the following is the configuration on my compute nodes: > [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt > libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 > libvirt-daemon-kvm-4.5.0-33.el7.x86_64 > libvirt-libs-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 > libvirt-client-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 > libvirt-daemon-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 > libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 > libvirt-bash-completion-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 > libvirt-python-4.5.0-1.el7.x86_64 > libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 > libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 > [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu > qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 > qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 > centos-release-qemu-ev-1.0-4.el7.centos.noarch > ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch > qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64 > > > As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
> > firewall_driver = iptables_hybrid > > I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
> environment and the > same firewall driver. > Live migration on provider network on queens works fine. > It does not work fine on rocky and stein (vm lost connection after it

...
> migrated and start to respond only when the vm send a network

is packet ,

for

...
> example when chrony pools the time server). > > Ignazio > > > > Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com>

...
> ha scritto: > > > On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > > Hello, some updated about this issue. > > > I read someone has got same issue as reported here: > > > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > > > If you read the discussion, someone tells that the garp must be

sent by

...
> > > qemu during live miration. > > > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
> > > > it is not correct. > > qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
> > learning frames > > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
> > https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html > > however it looks like this was broken in 2016 in qemu 2.6.0 > > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html > > but was fixed by > >

...
...
> > can you confirm you are not using the broken 2.6.0 release and are

using

...
> > 2.7 or newer or 2.4 and older. > > > > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
> > > packages I installed on queens (I updated compute and controllers

node

...
> > > > on > > > queens for obtaining same libvirt/qemu version deployed on rocky

and

...
> > > > stein). > > > > > > On queens live migration on provider network continues to work

fine.

...
> > > On rocky and stein not, so I think the issue is related to

openstack

...
> > > components . > > > > on queens we have only a singel prot binding and nova blindly assumes > > that the port binding details wont > > change when it does a live migration and does not update the xml for

the

...
> > netwrok interfaces. > > > > the port binding is updated after the migration is complete in > > post_livemigration > > in rocky+ neutron optionally uses the multiple port bindings flow to > > prebind the port to the destiatnion > > so it can update the xml if needed and if post copy live migration is > > enable it will asyconsly activate teh dest port > > binding before post_livemigration shortenting the downtime. > > > > if you are using the iptables firewall os-vif will have

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b precreated

...
the

...
> > ovs port and intermediate linux bridge before the > > migration started which will allow neutron to wire it up (put

it on

...
the

...
> > correct vlan and install security groups) before > > the vm completes the migraton. > > > > if you are using the ovs firewall os-vif still precreates teh

ovs

...
port

...
> > but libvirt deletes it and recreats it too. > > as a result there is a race when using openvswitch firewall

...
...
> > result in the RARP packets being lost. > > > > > > > > Best Regards > > > Ignazio Cassano > > > > > > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > > > smooney@redhat.com> > > > ha scritto: > > > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
> > > > > firewall. > > > > > So, can I solve using post copy live migration ??? > > > > > > > > so this behavior has always been how nova worked but rocky

that can the

...
...
> > > > > > > > > > > >

...
...
> > > > spec intoduced teh ablity to shorten the outage by pre biding the > > > > port and > > > > activating it when > > > > the vm is resumed on the destiation host before we get to

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... pos

...
live

...
> > > > migrate. > > > > > > > > this reduces the outage time although i cant be fully

elimiated

...
as

...
> > > > some > > > > level of packet loss is > > > > always expected when you live migrate. > > > > > > > > so yes enabliy post copy live migration should help but be

aware

...
that

...
> > > > if a > > > > network partion happens > > > > during a post copy live migration the vm will crash and

need to

...
be

...
> > > > restarted. > > > > it is generally safe to use and will imporve the migration

performace

...
> > > > but > > > > unlike pre copy migration if > > > > the guess resumes on the dest and the mempry page has not

been

...
copied

...
> > > > yet > > > > then it must wait for it to be copied > > > > and retrive it form the souce host. if the connection too

the

...
souce

...
> > > > host > > > > is intrupted then the vm cant > > > > do that and the migration will fail and the instance will

crash.

...
if

...
> > > > you > > > > are using precopy migration > > > > if there is a network partaion during the migration the

migration will

...
> > > > fail but the instance will continue > > > > to run on the source host. > > > > > > > > so while i would still recommend using it, i it just good

to be

...
aware

...
> > > > of > > > > that behavior change. > > > > > > > > > Thanks > > > > > Ignazio > > > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <

...
...
> > > > scritto: > > > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > > Hello, I have a problem on stein neutron. When a vm migrate > > > > from one > > > > > > > > node > > > > > > > to another I cannot ping it for several minutes. If in the

vm I

...
> > > > put a > > > > > > > script that ping the gateway continously, the live

migration

...
> > > > works > > > > > > > > fine > > > > > > > > > > > > and > > > > > > > I can ping it. Why this happens ? I read something about > > > > gratuitous > > > > > > > > arp. > > > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

...
...
> > > > protocal > > > > > > > > called > > > > > > RARP > > > > > > to do mac address learning. > > > > > > > > > > > > what release of openstack are you using. and are you using > > > > iptables > > > > > > firewall of openvswitch firewall. > > > > > > > > > > > > if you are using openvswtich there is is nothing we can do

until

...
> > > > we > > > > > > finally delegate vif pluging to os-vif. > > > > > > currently libvirt handels interface plugging for kernel ovs

when

...
> > > > using > > > > > > > > the > > > > > > openvswitch firewall driver > > > > > > https://review.opendev.org/#/c/602432/ would adress

smooney@redhat.com> ha older that

...
but it

...
> > > > and > > > > > > > > the > > > > > > neutron patch are > > > > > > https://review.opendev.org/#/c/640258 rather out

dated.

...
while

...
> > > > libvirt > > > > > > > > is > > > > > > pluging the vif there will always be > > > > > > a race condition where the RARP packets sent by qemu

and

...
then mac

...
> > > > > > > > learning > > > > > > packets will be lost. > > > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
> > > > rock or > > > > > > later then if you enable post copy live migration > > > > > > it should reduce the downtime. in this conficution we do not

have

...
> > > > the > > > > > > > > race > > > > > > betwen neutron and libvirt so the rarp > > > > > > packets should not be lost. > > > > > > > > > > > > > > > > > > > Please, help me ? > > > > > > > Any workaround , please ? > > > > > > > > > > > > > > Best Regards > > > > > > > Ignazio > > > > > > > > > > > > > > > > > > > > > > > >

Tobias Urdin

12 Mar 12 Mar

12:13 a.m.

...

Many thanks. Please keep in touch. here are the two patches.

Hello, If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet. Best regards ________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Friday, March 12, 2021 7:43:22 AM To: Tobias Urdin Cc: openstack-discuss Subject: Re: [stein][neutron] gratuitous arp Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the same. I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com<mailto:tobias.urdin@binero.com>> ha scritto: Hello, Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved. Best regards ________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano <ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled. this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs. i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...

Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:

...
Hello, some updated about this issue. I read someone has got same issue as reported here:

https://bugs.launchpad.net/neutron/+bug/1866139

If you read the discussion, someone tells that the garp must be

sent by

...
...
...
...
qemu during live miration. If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

...
So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
...
packages I installed on queens (I updated compute and controllers

node

...
...
...
on

...
queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
...
stein).

...
On queens live migration on provider network continues to work

fine.

...
...
...
...
On rocky and stein not, so I think the issue is related to

openstack

...
...
...
...
components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated

the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on

the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

...
Best Regards Ignazio Cassano

Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
ha scritto:

> On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
...
> > firewall. > > So, can I solve using post copy live migration ??? > > so this behavior has always been how nova worked but rocky the > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
...
> spec intoduced teh ablity to shorten the outage by pre biding the

port and

...
> activating it when > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate.

...
> > this reduces the outage time although i cant be fully elimiated

as

...
...
...
some

...
> level of packet loss is > always expected when you live migrate. > > so yes enabliy post copy live migration should help but be aware

that

...
...
...
if a

...
> network partion happens > during a post copy live migration the vm will crash and need to

be

...
...
...
...
> restarted. > it is generally safe to use and will imporve the migration

performace

...
...
...
but

...
> unlike pre copy migration if > the guess resumes on the dest and the mempry page has not been

copied

...
...
...
yet

...
> then it must wait for it to be copied > and retrive it form the souce host. if the connection too the

souce

...
...
...
host

...
> is intrupted then the vm cant > do that and the migration will fail and the instance will crash.

if

...
...
...
you

...
> are using precopy migration > if there is a network partaion during the migration the

migration will

...
...
...
...
> fail but the instance will continue > to run on the source host. > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of

...
> that behavior change. > > > Thanks > > Ignazio > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha

scritto:

...
> > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > Hello, I have a problem on stein neutron. When a vm migrate

from one

...
> > node > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
...
put a

...
> > > > script that ping the gateway continously, the live

migration

...
...
...
works

...
> > fine > > > > > > and > > > > I can ping it. Why this happens ? I read something about

gratuitous

...
> > arp. > > > > > > qemu does not use gratuitous arp but instead uses an older

protocal

...
> > called > > > RARP > > > to do mac address learning. > > > > > > what release of openstack are you using. and are you using

iptables

...
> > > firewall of openvswitch firewall. > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
...
we

...
> > > finally delegate vif pluging to os-vif. > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
...
using

...
> > the > > > openvswitch firewall driver > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and

...
> > the > > > neutron patch are > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt

...
> > is > > > pluging the vif there will always be > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
...
> > learning > > > packets will be lost. > > > > > > if you are using the iptables firewall and you have opnestack

rock or

...
> > > later then if you enable post copy live migration > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the

...
> > race > > > betwen neutron and libvirt so the rarp > > > packets should not be lost. > > > > > > > > > > Please, help me ? > > > > Any workaround , please ? > > > > > > > > Best Regards > > > > Ignazio > > > > > > > >

Sean Mooney

2:36 p.m.

...

Hello,

If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet.

On Fri, 2021-03-12 at 08:13 +0000, Tobias Urdin wrote: there is a downstream bug trackker for this https://bugzilla.redhat.com/show_bug.cgi?id=1917675 its fixed by a combination of 3 enturon patches and i think 1 nova one https://review.opendev.org/c/openstack/neutron/+/766277/ https://review.opendev.org/c/openstack/neutron/+/753314/ https://review.opendev.org/c/openstack/neutron/+/640258/ and https://review.opendev.org/c/openstack/nova/+/770745 the first tree neutron patches would fix the evauate case but break live migration the nova patch means live migration will work too although to fully fix the related live migration packet loss issues you need https://review.opendev.org/c/openstack/nova/+/747454/4 https://review.opendev.org/c/openstack/nova/+/742180/12 to fix live migration with network abckend that dont suppor tmultiple port binding and https://review.opendev.org/c/openstack/nova/+/602432 (the only one not merged yet.) for live migrateon with ovs and hybridg plug=false (e.g. ovs firewall driver, noop or ovn instead of ml2/ovs. multiple port binding was not actully the reason for this there was a race in neutorn itslef that would have haapend even without multiple port binding between the dhcp agent and l2 agent. some of those patches have been backported already and all shoudl eventually make ti to train the could be brought to stine potentially if peopel are open to backport/review them.

...

Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Friday, March 12, 2021 7:43:22 AM To: Tobias Urdin Cc: openstack-discuss Subject: Re: [stein][neutron] gratuitous arp

Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the same. I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio

Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com<mailto:tobias.urdin@binero.com>> ha scritto:

Hello,

Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but

are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved.

Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp

Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio

Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano <ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio

...
Many thanks. Please keep in touch. here are the two patches.

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.

...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha scritto:

...
so bing pragmatic i think the simplest path forward given my other patches have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network packet ,

for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
...
ha scritto:

...
On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > Hello, some updated about this issue. > I read someone has got same issue as reported here: > > https://bugs.launchpad.net/neutron/+bug/1866139 > > If you read the discussion, someone tells that the garp must be

sent by

...
...
...
> qemu during live miration. > If this is true, this means on rocky/stein the qemu/libvirt are

bugged.

...
...
...
it is not correct. qemu/libvir thas alsway used RARP which predates GARP to serve as

its mac

...
...
...
learning frames instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
...
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html however it looks like this was broken in 2016 in qemu 2.6.0 https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html but was fixed by

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
can you confirm you are not using the broken 2.6.0 release and are

using

...
...
...
2.7 or newer or 2.4 and older.

> So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
...
> packages I installed on queens (I updated compute and controllers

node

...
...
...
on > queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
...
stein). > > On queens live migration on provider network continues to work

fine.

...
...
...
> On rocky and stein not, so I think the issue is related to

openstack

...
...
...
> components .

on queens we have only a singel prot binding and nova blindly assumes that the port binding details wont change when it does a live migration and does not update the xml for

the

...
...
...
netwrok interfaces.

the port binding is updated after the migration is complete in post_livemigration in rocky+ neutron optionally uses the multiple port bindings flow to prebind the port to the destiatnion so it can update the xml if needed and if post copy live migration is enable it will asyconsly activate teh dest port binding before post_livemigration shortenting the downtime.

if you are using the iptables firewall os-vif will have precreated

the

...
...
...
ovs port and intermediate linux bridge before the migration started which will allow neutron to wire it up (put it on

the

...
...
...
correct vlan and install security groups) before the vm completes the migraton.

if you are using the ovs firewall os-vif still precreates teh ovs

port

...
...
...
but libvirt deletes it and recreats it too. as a result there is a race when using openvswitch firewall that can result in the RARP packets being lost.

> > Best Regards > Ignazio Cassano > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>> > ha scritto: > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
...
> > > firewall. > > > So, can I solve using post copy live migration ??? > > > > so this behavior has always been how nova worked but rocky the > > > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
> > spec intoduced teh ablity to shorten the outage by pre biding the

port and > > activating it when > > the vm is resumed on the destiation host before we get to pos

live

...
...
...
migrate. > > > > this reduces the outage time although i cant be fully elimiated

as

...
...
...
some > > level of packet loss is > > always expected when you live migrate. > > > > so yes enabliy post copy live migration should help but be aware

that

...
...
...
if a > > network partion happens > > during a post copy live migration the vm will crash and need to

be

...
...
...
> > restarted. > > it is generally safe to use and will imporve the migration

performace

...
...
...
but > > unlike pre copy migration if > > the guess resumes on the dest and the mempry page has not been

copied

...
...
...
yet > > then it must wait for it to be copied > > and retrive it form the souce host. if the connection too the

souce

...
...
...
host > > is intrupted then the vm cant > > do that and the migration will fail and the instance will crash.

if

...
...
...
you > > are using precopy migration > > if there is a network partaion during the migration the

migration will

...
...
...
> > fail but the instance will continue > > to run on the source host. > > > > so while i would still recommend using it, i it just good to be

aware

...
...
...
of > > that behavior change. > > > > > Thanks > > > Ignazio > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> ha

scritto: > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > Hello, I have a problem on stein neutron. When a vm migrate

from one > > > > node > > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
...
put a > > > > > script that ping the gateway continously, the live

migration

...
...
...
works > > > > fine > > > > > > > > and > > > > > I can ping it. Why this happens ? I read something about

gratuitous > > > > arp. > > > > > > > > qemu does not use gratuitous arp but instead uses an older

protocal > > > > called > > > > RARP > > > > to do mac address learning. > > > > > > > > what release of openstack are you using. and are you using

iptables > > > > firewall of openvswitch firewall. > > > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
...
we > > > > finally delegate vif pluging to os-vif. > > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
...
using > > > > the > > > > openvswitch firewall driver > > > > https://review.opendev.org/#/c/602432/ would adress that

but it

...
...
...
and > > > > the > > > > neutron patch are > > > > https://review.opendev.org/#/c/640258 rather out dated.

while

...
...
...
libvirt > > > > is > > > > pluging the vif there will always be > > > > a race condition where the RARP packets sent by qemu and

then mac

...
...
...
> > > > learning > > > > packets will be lost. > > > > > > > > if you are using the iptables firewall and you have opnestack

rock or > > > > later then if you enable post copy live migration > > > > it should reduce the downtime. in this conficution we do not

have

...
...
...
the > > > > race > > > > betwen neutron and libvirt so the rarp > > > > packets should not be lost. > > > > > > > > > > > > > Please, help me ? > > > > > Any workaround , please ? > > > > > > > > > > Best Regards > > > > > Ignazio > > > > > > > > > > > >

Ignazio Cassano

13 Mar 13 Mar

12:56 p.m.

Many thanks for your explanation, Sean. Ignazio Il Ven 12 Mar 2021, 23:44 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Hello,

If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet.

On Fri, 2021-03-12 at 08:13 +0000, Tobias Urdin wrote: there is a downstream bug trackker for this

https://bugzilla.redhat.com/show_bug.cgi?id=1917675

its fixed by a combination of 3 enturon patches and i think 1 nova one

https://review.opendev.org/c/openstack/neutron/+/766277/ https://review.opendev.org/c/openstack/neutron/+/753314/ https://review.opendev.org/c/openstack/neutron/+/640258/

and https://review.opendev.org/c/openstack/nova/+/770745

the first tree neutron patches would fix the evauate case but break live migration the nova patch means live migration will work too although to fully fix the related live migration packet loss issues you need

https://review.opendev.org/c/openstack/nova/+/747454/4 https://review.opendev.org/c/openstack/nova/+/742180/12 to fix live migration with network abckend that dont suppor tmultiple port binding and https://review.opendev.org/c/openstack/nova/+/602432 (the only one not merged yet.) for live migrateon with ovs and hybridg plug=false (e.g. ovs firewall driver, noop or ovn instead of ml2/ovs.

multiple port binding was not actully the reason for this there was a race in neutorn itslef that would have haapend even without multiple port binding between the dhcp agent and l2 agent.

some of those patches have been backported already and all shoudl eventually make ti to train the could be brought to stine potentially if peopel are open to backport/review them.

...
Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Friday, March 12, 2021 7:43:22 AM To: Tobias Urdin Cc: openstack-discuss Subject: Re: [stein][neutron] gratuitous arp

Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the

...
I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio

Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com<mailto: tobias.urdin@binero.com>> ha scritto:

Hello,

Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but

are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved.

Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com<mailto: ignaziocassano@gmail.com>> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp

Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio

Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio

...
Many thanks. Please keep in touch. here are the two patches.

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto: smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test

same. this locally tomorow.

...
...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <

...
...
ha scritto:

...
so bing pragmatic i think the simplest path forward given my other

...
...
...
have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network

...
...
...
for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
...
ha scritto:

> On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > Hello, some updated about this issue. > > I read someone has got same issue as reported here: > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > If you read the discussion, someone tells that the garp must

be

...
sent by

...
...
> > qemu during live miration. > > If this is true, this means on rocky/stein the qemu/libvirt

are

...
bugged.

...
...
> > it is not correct. > qemu/libvir thas alsway used RARP which predates GARP to serve

as

...
its mac

...
...
> learning frames > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
>

https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
...
...
> however it looks like this was broken in 2016 in qemu 2.6.0 > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html > but was fixed by >

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
> can you confirm you are not using the broken 2.6.0 release and are

using

...
...
> 2.7 or newer or 2.4 and older. > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
> > packages I installed on queens (I updated compute and controllers

node

...
...
> > on > > queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
> > stein). > > > > On queens live migration on provider network continues to work

fine.

...
...
> > On rocky and stein not, so I think the issue is related to

openstack

...
...
> > components . > > on queens we have only a singel prot binding and nova blindly assumes > that the port binding details wont > change when it does a live migration and does not update the xml for

the

...
...
> netwrok interfaces. > > the port binding is updated after the migration is complete in > post_livemigration > in rocky+ neutron optionally uses the multiple port bindings flow to > prebind the port to the destiatnion > so it can update the xml if needed and if post copy live migration is > enable it will asyconsly activate teh dest port > binding before post_livemigration shortenting the downtime. > > if you are using the iptables firewall os-vif will have

...
...
...
the

...
...
> ovs port and intermediate linux bridge before the > migration started which will allow neutron to wire it up (put

it on

...
the

...
...
> correct vlan and install security groups) before > the vm completes the migraton. > > if you are using the ovs firewall os-vif still precreates teh

ovs

...
port

...
...
> but libvirt deletes it and recreats it too. > as a result there is a race when using openvswitch firewall

...
...
...
...
...
> result in the RARP packets being lost. > > > > > Best Regards > > Ignazio Cassano > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > smooney@redhat.com<mailto:smooney@redhat.com>> > > ha scritto: > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
> > > > firewall. > > > > So, can I solve using post copy live migration ??? > > > > > > so this behavior has always been how nova worked but rocky

...
...
...
...
...
> > > > > > > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
> > > spec intoduced teh ablity to shorten the outage by pre biding the > > port and > > > activating it when > > > the vm is resumed on the destiation host before we get to

...
...
...
live

...
...
> > migrate. > > > > > > this reduces the outage time although i cant be fully

elimiated

...
as

...
...
> > some > > > level of packet loss is > > > always expected when you live migrate. > > > > > > so yes enabliy post copy live migration should help but be

aware

...
that

...
...
> > if a > > > network partion happens > > > during a post copy live migration the vm will crash and

need to

...
be

...
...
> > > restarted. > > > it is generally safe to use and will imporve the migration

performace

...
...
> > but > > > unlike pre copy migration if > > > the guess resumes on the dest and the mempry page has not

been

...
copied

...
...
> > yet > > > then it must wait for it to be copied > > > and retrive it form the souce host. if the connection too

...
...
...
souce

...
...
> > host > > > is intrupted then the vm cant > > > do that and the migration will fail and the instance will

crash.

...
if

...
...
> > you > > > are using precopy migration > > > if there is a network partaion during the migration the

migration will

...
...
> > > fail but the instance will continue > > > to run on the source host. > > > > > > so while i would still recommend using it, i it just good

to be

...
aware

...
...
> > of > > > that behavior change. > > > > > > > Thanks > > > > Ignazio > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>> ha

...
...
...
> > scritto: > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > Hello, I have a problem on stein neutron. When a vm migrate > > from one > > > > > > node > > > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
> > put a > > > > > > script that ping the gateway continously, the live

migration

...
...
> > works > > > > > > fine > > > > > > > > > > and > > > > > > I can ping it. Why this happens ? I read something about > > gratuitous > > > > > > arp. > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

...
...
...
...
...
> > protocal > > > > > > called > > > > > RARP > > > > > to do mac address learning. > > > > > > > > > > what release of openstack are you using. and are you using > > iptables > > > > > firewall of openvswitch firewall. > > > > > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
> > we > > > > > finally delegate vif pluging to os-vif. > > > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
> > using > > > > > > the > > > > > openvswitch firewall driver > > > > > https://review.opendev.org/#/c/602432/ would adress

smooney@redhat.com<mailto:smooney@redhat.com>> patches packet , precreated that can the pos the older that

...
...
...
but it

...
...
> > and > > > > > > the > > > > > neutron patch are > > > > > https://review.opendev.org/#/c/640258 rather out

dated.

...
while

...
...
> > libvirt > > > > > > is > > > > > pluging the vif there will always be > > > > > a race condition where the RARP packets sent by qemu

and

...
then mac

...
...
> > > > > > learning > > > > > packets will be lost. > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
...
> > rock or > > > > > later then if you enable post copy live migration > > > > > it should reduce the downtime. in this conficution we do not

have

...
...
> > the > > > > > > race > > > > > betwen neutron and libvirt so the rarp > > > > > packets should not be lost. > > > > > > > > > > > > > > > > Please, help me ? > > > > > > Any workaround , please ? > > > > > > > > > > > > Best Regards > > > > > > Ignazio > > > > > > > > > > > > > > > > > >

Ignazio Cassano

10 May 10 May

8:16 a.m.

Hello Sean, I am testing the openstack migration on centos 7 train and live migration stops again: live migrated instances stop to responding to ping requests. I did not understand if I must apply patches you suggested in your last email to me and also the following: https://review.opendev.org/c/openstack/nova/+/741529 Il giorno ven 12 mar 2021 alle ore 23:44 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Hello,

If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet.

On Fri, 2021-03-12 at 08:13 +0000, Tobias Urdin wrote: there is a downstream bug trackker for this

https://bugzilla.redhat.com/show_bug.cgi?id=1917675

its fixed by a combination of 3 enturon patches and i think 1 nova one

https://review.opendev.org/c/openstack/neutron/+/766277/ https://review.opendev.org/c/openstack/neutron/+/753314/ https://review.opendev.org/c/openstack/neutron/+/640258/

and https://review.opendev.org/c/openstack/nova/+/770745

the first tree neutron patches would fix the evauate case but break live migration the nova patch means live migration will work too although to fully fix the related live migration packet loss issues you need

https://review.opendev.org/c/openstack/nova/+/747454/4 https://review.opendev.org/c/openstack/nova/+/742180/12 to fix live migration with network abckend that dont suppor tmultiple port binding and https://review.opendev.org/c/openstack/nova/+/602432 (the only one not merged yet.) for live migrateon with ovs and hybridg plug=false (e.g. ovs firewall driver, noop or ovn instead of ml2/ovs.

multiple port binding was not actully the reason for this there was a race in neutorn itslef that would have haapend even without multiple port binding between the dhcp agent and l2 agent.

some of those patches have been backported already and all shoudl eventually make ti to train the could be brought to stine potentially if peopel are open to backport/review them.

...
Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Friday, March 12, 2021 7:43:22 AM To: Tobias Urdin Cc: openstack-discuss Subject: Re: [stein][neutron] gratuitous arp

Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the

...
I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio

Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com<mailto: tobias.urdin@binero.com>> ha scritto:

Hello,

Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but

are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved.

Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com<mailto: ignaziocassano@gmail.com>> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp

Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio

Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio

...
Many thanks. Please keep in touch. here are the two patches.

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto: smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test

same. this locally tomorow.

...
...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <

...
...
ha scritto:

...
so bing pragmatic i think the simplest path forward given my other

...
...
...
have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network

...
...
...
for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
...
ha scritto:

> On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > Hello, some updated about this issue. > > I read someone has got same issue as reported here: > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > If you read the discussion, someone tells that the garp must

be

...
sent by

...
...
> > qemu during live miration. > > If this is true, this means on rocky/stein the qemu/libvirt

are

...
bugged.

...
...
> > it is not correct. > qemu/libvir thas alsway used RARP which predates GARP to serve

as

...
its mac

...
...
> learning frames > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
>

https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
...
...
> however it looks like this was broken in 2016 in qemu 2.6.0 > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html > but was fixed by >

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
> can you confirm you are not using the broken 2.6.0 release and are

using

...
...
> 2.7 or newer or 2.4 and older. > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
> > packages I installed on queens (I updated compute and controllers

node

...
...
> > on > > queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
> > stein). > > > > On queens live migration on provider network continues to work

fine.

...
...
> > On rocky and stein not, so I think the issue is related to

openstack

...
...
> > components . > > on queens we have only a singel prot binding and nova blindly assumes > that the port binding details wont > change when it does a live migration and does not update the xml for

the

...
...
> netwrok interfaces. > > the port binding is updated after the migration is complete in > post_livemigration > in rocky+ neutron optionally uses the multiple port bindings flow to > prebind the port to the destiatnion > so it can update the xml if needed and if post copy live migration is > enable it will asyconsly activate teh dest port > binding before post_livemigration shortenting the downtime. > > if you are using the iptables firewall os-vif will have

...
...
...
the

...
...
> ovs port and intermediate linux bridge before the > migration started which will allow neutron to wire it up (put

it on

...
the

...
...
> correct vlan and install security groups) before > the vm completes the migraton. > > if you are using the ovs firewall os-vif still precreates teh

ovs

...
port

...
...
> but libvirt deletes it and recreats it too. > as a result there is a race when using openvswitch firewall

...
...
...
...
...
> result in the RARP packets being lost. > > > > > Best Regards > > Ignazio Cassano > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > smooney@redhat.com<mailto:smooney@redhat.com>> > > ha scritto: > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
> > > > firewall. > > > > So, can I solve using post copy live migration ??? > > > > > > so this behavior has always been how nova worked but rocky

...
...
...
...
...
> > > > > > > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
> > > spec intoduced teh ablity to shorten the outage by pre biding the > > port and > > > activating it when > > > the vm is resumed on the destiation host before we get to

...
...
...
live

...
...
> > migrate. > > > > > > this reduces the outage time although i cant be fully

elimiated

...
as

...
...
> > some > > > level of packet loss is > > > always expected when you live migrate. > > > > > > so yes enabliy post copy live migration should help but be

aware

...
that

...
...
> > if a > > > network partion happens > > > during a post copy live migration the vm will crash and

need to

...
be

...
...
> > > restarted. > > > it is generally safe to use and will imporve the migration

performace

...
...
> > but > > > unlike pre copy migration if > > > the guess resumes on the dest and the mempry page has not

been

...
copied

...
...
> > yet > > > then it must wait for it to be copied > > > and retrive it form the souce host. if the connection too

...
...
...
souce

...
...
> > host > > > is intrupted then the vm cant > > > do that and the migration will fail and the instance will

crash.

...
if

...
...
> > you > > > are using precopy migration > > > if there is a network partaion during the migration the

migration will

...
...
> > > fail but the instance will continue > > > to run on the source host. > > > > > > so while i would still recommend using it, i it just good

to be

...
aware

...
...
> > of > > > that behavior change. > > > > > > > Thanks > > > > Ignazio > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>> ha

...
...
...
> > scritto: > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > Hello, I have a problem on stein neutron. When a vm migrate > > from one > > > > > > node > > > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
> > put a > > > > > > script that ping the gateway continously, the live

migration

...
...
> > works > > > > > > fine > > > > > > > > > > and > > > > > > I can ping it. Why this happens ? I read something about > > gratuitous > > > > > > arp. > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

...
...
...
...
...
> > protocal > > > > > > called > > > > > RARP > > > > > to do mac address learning. > > > > > > > > > > what release of openstack are you using. and are you using > > iptables > > > > > firewall of openvswitch firewall. > > > > > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
> > we > > > > > finally delegate vif pluging to os-vif. > > > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
> > using > > > > > > the > > > > > openvswitch firewall driver > > > > > https://review.opendev.org/#/c/602432/ would adress

smooney@redhat.com<mailto:smooney@redhat.com>> patches packet , precreated that can the pos the older that

...
...
...
but it

...
...
> > and > > > > > > the > > > > > neutron patch are > > > > > https://review.opendev.org/#/c/640258 rather out

dated.

...
while

...
...
> > libvirt > > > > > > is > > > > > pluging the vif there will always be > > > > > a race condition where the RARP packets sent by qemu

and

...
then mac

...
...
> > > > > > learning > > > > > packets will be lost. > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
...
> > rock or > > > > > later then if you enable post copy live migration > > > > > it should reduce the downtime. in this conficution we do not

have

...
...
> > the > > > > > > race > > > > > betwen neutron and libvirt so the rarp > > > > > packets should not be lost. > > > > > > > > > > > > > > > > Please, help me ? > > > > > > Any workaround , please ? > > > > > > > > > > > > Best Regards > > > > > > Ignazio > > > > > > > > > > > > > > > > > >

Ignazio Cassano

13 May 13 May

10:03 p.m.

Hello, I am trying to apply suggested patches but after applying some python error codes do not allow neutron services to start. Since my python skill is poor, I wonder if that patches are for python3. I am on centos 7 and (probably???) patches are for centos 8. I also wonder if it possible to upgrade a centos 7 train to centos 8 train without reinstalling all nodes. This could be important for next release upgrade (ussuri, victoria and so on). Ignazio Il Ven 12 Mar 2021, 23:44 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Hello,

If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet.

On Fri, 2021-03-12 at 08:13 +0000, Tobias Urdin wrote: there is a downstream bug trackker for this

https://bugzilla.redhat.com/show_bug.cgi?id=1917675

its fixed by a combination of 3 enturon patches and i think 1 nova one

https://review.opendev.org/c/openstack/neutron/+/766277/ https://review.opendev.org/c/openstack/neutron/+/753314/ https://review.opendev.org/c/openstack/neutron/+/640258/

and https://review.opendev.org/c/openstack/nova/+/770745

the first tree neutron patches would fix the evauate case but break live migration the nova patch means live migration will work too although to fully fix the related live migration packet loss issues you need

https://review.opendev.org/c/openstack/nova/+/747454/4 https://review.opendev.org/c/openstack/nova/+/742180/12 to fix live migration with network abckend that dont suppor tmultiple port binding and https://review.opendev.org/c/openstack/nova/+/602432 (the only one not merged yet.) for live migrateon with ovs and hybridg plug=false (e.g. ovs firewall driver, noop or ovn instead of ml2/ovs.

multiple port binding was not actully the reason for this there was a race in neutorn itslef that would have haapend even without multiple port binding between the dhcp agent and l2 agent.

some of those patches have been backported already and all shoudl eventually make ti to train the could be brought to stine potentially if peopel are open to backport/review them.

...
Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Friday, March 12, 2021 7:43:22 AM To: Tobias Urdin Cc: openstack-discuss Subject: Re: [stein][neutron] gratuitous arp

Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the

...
I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio

Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com<mailto: tobias.urdin@binero.com>> ha scritto:

Hello,

Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but

are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved.

Best regards

________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com<mailto: ignaziocassano@gmail.com>> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp

Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio

Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio

...
Many thanks. Please keep in touch. here are the two patches.

Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto: smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case

the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.

this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.

i have not had time to deploy a 2 node env today but ill try and test

same. this locally tomorow.

...
...
Ignazio

Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <

...
...
ha scritto:

...
so bing pragmatic i think the simplest path forward given my other

...
...
...
have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.

i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.

im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.

On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:

...
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio

Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:

...
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64

As far as firewall driver

/etc/neutron/plugins/ml2/openvswitch_agent.ini:

...
...
firewall_driver = iptables_hybrid

I have same libvirt/qemu version on queens, on rocky and on stein

testing

...
...
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it

is

...
...
migrated and start to respond only when the vm send a network

...
...
...
for

...
...
example when chrony pools the time server).

Ignazio

Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>>

...
...
ha scritto:

> On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > Hello, some updated about this issue. > > I read someone has got same issue as reported here: > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > If you read the discussion, someone tells that the garp must

be

...
sent by

...
...
> > qemu during live miration. > > If this is true, this means on rocky/stein the qemu/libvirt

are

...
bugged.

...
...
> > it is not correct. > qemu/libvir thas alsway used RARP which predates GARP to serve

as

...
its mac

...
...
> learning frames > instead

https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol

...
...
>

https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html

...
...
...
> however it looks like this was broken in 2016 in qemu 2.6.0 > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html > but was fixed by >

https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b

...
...
...
> can you confirm you are not using the broken 2.6.0 release and are

using

...
...
> 2.7 or newer or 2.4 and older. > > > > So I tried to use stein and rocky with the same version of

libvirt/qemu

...
...
> > packages I installed on queens (I updated compute and controllers

node

...
...
> > on > > queens for obtaining same libvirt/qemu version deployed on rocky

and

...
...
> > stein). > > > > On queens live migration on provider network continues to work

fine.

...
...
> > On rocky and stein not, so I think the issue is related to

openstack

...
...
> > components . > > on queens we have only a singel prot binding and nova blindly assumes > that the port binding details wont > change when it does a live migration and does not update the xml for

the

...
...
> netwrok interfaces. > > the port binding is updated after the migration is complete in > post_livemigration > in rocky+ neutron optionally uses the multiple port bindings flow to > prebind the port to the destiatnion > so it can update the xml if needed and if post copy live migration is > enable it will asyconsly activate teh dest port > binding before post_livemigration shortenting the downtime. > > if you are using the iptables firewall os-vif will have

...
...
...
the

...
...
> ovs port and intermediate linux bridge before the > migration started which will allow neutron to wire it up (put

it on

...
the

...
...
> correct vlan and install security groups) before > the vm completes the migraton. > > if you are using the ovs firewall os-vif still precreates teh

ovs

...
port

...
...
> but libvirt deletes it and recreats it too. > as a result there is a race when using openvswitch firewall

...
...
...
...
...
> result in the RARP packets being lost. > > > > > Best Regards > > Ignazio Cassano > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > smooney@redhat.com<mailto:smooney@redhat.com>> > > ha scritto: > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > Hello, I have this problem with rocky or newer with

iptables_hybrid

...
...
> > > > firewall. > > > > So, can I solve using post copy live migration ??? > > > > > > so this behavior has always been how nova worked but rocky

...
...
...
...
...
> > > > > > > >

https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...

...
...
...
> > > spec intoduced teh ablity to shorten the outage by pre biding the > > port and > > > activating it when > > > the vm is resumed on the destiation host before we get to

...
...
...
live

...
...
> > migrate. > > > > > > this reduces the outage time although i cant be fully

elimiated

...
as

...
...
> > some > > > level of packet loss is > > > always expected when you live migrate. > > > > > > so yes enabliy post copy live migration should help but be

aware

...
that

...
...
> > if a > > > network partion happens > > > during a post copy live migration the vm will crash and

need to

...
be

...
...
> > > restarted. > > > it is generally safe to use and will imporve the migration

performace

...
...
> > but > > > unlike pre copy migration if > > > the guess resumes on the dest and the mempry page has not

been

...
copied

...
...
> > yet > > > then it must wait for it to be copied > > > and retrive it form the souce host. if the connection too

...
...
...
souce

...
...
> > host > > > is intrupted then the vm cant > > > do that and the migration will fail and the instance will

crash.

...
if

...
...
> > you > > > are using precopy migration > > > if there is a network partaion during the migration the

migration will

...
...
> > > fail but the instance will continue > > > to run on the source host. > > > > > > so while i would still recommend using it, i it just good

to be

...
aware

...
...
> > of > > > that behavior change. > > > > > > > Thanks > > > > Ignazio > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <

smooney@redhat.com<mailto:smooney@redhat.com>> ha

...
...
...
> > scritto: > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > Hello, I have a problem on stein neutron. When a vm migrate > > from one > > > > > > node > > > > > > to another I cannot ping it for several minutes. If in the

vm I

...
...
> > put a > > > > > > script that ping the gateway continously, the live

migration

...
...
> > works > > > > > > fine > > > > > > > > > > and > > > > > > I can ping it. Why this happens ? I read something about > > gratuitous > > > > > > arp. > > > > > > > > > > qemu does not use gratuitous arp but instead uses an

...
...
...
...
...
> > protocal > > > > > > called > > > > > RARP > > > > > to do mac address learning. > > > > > > > > > > what release of openstack are you using. and are you using > > iptables > > > > > firewall of openvswitch firewall. > > > > > > > > > > if you are using openvswtich there is is nothing we can do

until

...
...
> > we > > > > > finally delegate vif pluging to os-vif. > > > > > currently libvirt handels interface plugging for kernel ovs

when

...
...
> > using > > > > > > the > > > > > openvswitch firewall driver > > > > > https://review.opendev.org/#/c/602432/ would adress

smooney@redhat.com<mailto:smooney@redhat.com>> patches packet , precreated that can the pos the older that

...
...
...
but it

...
...
> > and > > > > > > the > > > > > neutron patch are > > > > > https://review.opendev.org/#/c/640258 rather out

dated.

...
while

...
...
> > libvirt > > > > > > is > > > > > pluging the vif there will always be > > > > > a race condition where the RARP packets sent by qemu

and

...
then mac

...
...
> > > > > > learning > > > > > packets will be lost. > > > > > > > > > > if you are using the iptables firewall and you have

opnestack

...
...
...
> > rock or > > > > > later then if you enable post copy live migration > > > > > it should reduce the downtime. in this conficution we do not

have

...
...
> > the > > > > > > race > > > > > betwen neutron and libvirt so the rarp > > > > > packets should not be lost. > > > > > > > > > > > > > > > > Please, help me ? > > > > > > Any workaround , please ? > > > > > > > > > > > > Best Regards > > > > > > Ignazio > > > > > > > > > > > > > > > > > >

Ignazio Cassano

11:02 p.m.

Hello, I am trying to apply suggested patches but after applying some python error codes do not allow neutron services to start. Since my python skill is poor, I wonder if that patches are for python3. I am on centos 7 and (probably???) patches are for centos 8. I also wonder if it possible to upgrade a centos 7 train to centos 8 train without reinstalling all nodes. This could be important for next release upgrade (ussuri, victoria and so on). Ignazio Il giorno ven 12 mar 2021 alle ore 23:44 Sean Mooney <smooney@redhat.com> ha scritto:

...

...
Hello,

If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet.

On Fri, 2021-03-12 at 08:13 +0000, Tobias Urdin wrote: there is a downstream bug trackker for this

https://bugzilla.redhat.com/show_bug.cgi?id=1917675

its fixed by a combination of 3 enturon patches and i think 1 nova one

https://review.opendev.org/c/openstack/neutron/+/766277/ https://review.opendev.org/c/openstack/neutron/+/753314/ https://review.opendev.org/c/openstack/neutron/+/640258/

and https://review.opendev.org/c/openstack/nova/+/770745

the first tree neutron patches would fix the evauate case but break live migration the nova patch means live migration will work too although to fully fix the related live migration packet loss issues you need

https://review.opendev.org/c/openstack/nova/+/747454/4 https://review.opendev.org/c/openstack/nova/+/742180/12 to fix live migration with network abckend that dont suppor tmultiple port binding and https://review.opendev.org/c/openstack/nova/+/602432 (the only one not merged yet.) for live migrateon with ovs and hybridg plug=false (e.g. ovs firewall driver, noop or ovn instead of ml2/ovs.

multiple port binding was not actully the reason for this there was a race in neutorn itslef that would have haapend even without multiple port binding between the dhcp agent and l2 agent.

some of those patches have been backported already and all shoudl eventually make ti to train the could be brought to stine potentially if peopel are open to backport/review them.

...
Best regards

1523

Age (days ago)

1905

Last active (days ago)

List overview

Download

33 comments

3 participants

participants (3)

Ignazio Cassano
Sean Mooney
Tobias Urdin

[stein][neutron] gratuitous arp

tags

participants (3)