<div dir="auto">Hello, I tried to update to last stein packages on yum and seems this bug still exists. <div dir="auto">Before the yum update I patched some files as suggested and and ping to vm worked fine.</div><div dir="auto">After yum update the issue returns.</div><div dir="auto">Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions.</div><div dir="auto">Thanks </div><div dir="auto">Ignazio</div><div dir="auto"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il Mer 29 Apr 2020, 19:49 Sean Mooney <<a href="mailto:smooney@redhat.com">smooney@redhat.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote:<br>
> Many thanks.<br>
> Please keep in touch.<br>
here are the two patches.<br>
the first <a href="https://review.opendev.org/#/c/724386/" rel="noreferrer noreferrer" target="_blank">https://review.opendev.org/#/c/724386/</a> is the actual change to add the new config opition<br>
this needs a release note and some tests but it shoudl be functional hence the [WIP]<br>
i have not enable the workaround in any job in this patch so the ci run will assert this does not break<br>
anything in the default case<br>
<br>
the second patch is <a href="https://review.opendev.org/#/c/724387/" rel="noreferrer noreferrer" target="_blank">https://review.opendev.org/#/c/724387/</a> which enables the workaround in the multi node ci jobs<br>
and is testing that live migration exctra works when the workaround is enabled.<br>
<br>
this should work as it is what we expect to happen if you are using a moderne nova with an old neutron.<br>
its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing<br>
it for one of the jobs to get ci coverage but not all of the jobs.<br>
<br>
i have not had time to deploy a 2 node env today but ill try and test this locally tomorow.<br>
<br>
<br>
<br>
> Ignazio<br>
> <br>
> Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank" rel="noreferrer">smooney@redhat.com</a>><br>
> ha scritto:<br>
> <br>
> > so bing pragmatic i think the simplest path forward given my other patches<br>
> > have not laned<br>
> > in almost 2 years is to quickly add a workaround config option to disable<br>
> > mulitple port bindign<br>
> > which we can backport and then we can try and work on the actual fix after.<br>
> > acording to <a href="https://bugs.launchpad.net/neutron/+bug/1815989" rel="noreferrer noreferrer" target="_blank">https://bugs.launchpad.net/neutron/+bug/1815989</a> that shoudl<br>
> > serve as a workaround<br>
> > for thos that hav this issue but its a regression in functionality.<br>
> > <br>
> > i can create a patch that will do that in an hour or so and submit a<br>
> > followup DNM patch to enabel the<br>
> > workaound in one of the gate jobs that tests live migration.<br>
> > i have a meeting in 10 mins and need to finish the pacht im currently<br>
> > updating but ill submit a poc once that is done.<br>
> > <br>
> > im not sure if i will be able to spend time on the actul fix which i<br>
> > proposed last year but ill see what i can do.<br>
> > <br>
> > <br>
> > On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:<br>
> > > PS<br>
> > > I have testing environment on queens,rocky and stein and I can make test<br>
> > > as you need.<br>
> > > Ignazio<br>
> > > <br>
> > > Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano <<br>
> > > <a href="mailto:ignaziocassano@gmail.com" target="_blank" rel="noreferrer">ignaziocassano@gmail.com</a>> ha scritto:<br>
> > > <br>
> > > > Hello Sean,<br>
> > > > the following is the configuration on my compute nodes:<br>
> > > > [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt<br>
> > > > libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-kvm-4.5.0-33.el7.x86_64<br>
> > > > libvirt-libs-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-network-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64<br>
> > > > libvirt-client-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64<br>
> > > > libvirt-bash-completion-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64<br>
> > > > libvirt-python-4.5.0-1.el7.x86_64<br>
> > > > libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64<br>
> > > > libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64<br>
> > > > [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu<br>
> > > > qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64<br>
> > > > qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64<br>
> > > > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64<br>
> > > > centos-release-qemu-ev-1.0-4.el7.centos.noarch<br>
> > > > ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch<br>
> > > > qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64<br>
> > > > <br>
> > > > <br>
> > > > As far as firewall driver<br>
> > <br>
> > /etc/neutron/plugins/ml2/openvswitch_agent.ini:<br>
> > > > <br>
> > > > firewall_driver = iptables_hybrid<br>
> > > > <br>
> > > > I have same libvirt/qemu version on queens, on rocky and on stein<br>
> > <br>
> > testing<br>
> > > > environment and the<br>
> > > > same firewall driver.<br>
> > > > Live migration on provider network on queens works fine.<br>
> > > > It does not work fine on rocky and stein (vm lost connection after it<br>
> > <br>
> > is<br>
> > > > migrated and start to respond only when the vm send a network packet ,<br>
> > <br>
> > for<br>
> > > > example when chrony pools the time server).<br>
> > > > <br>
> > > > Ignazio<br>
> > > > <br>
> > > > <br>
> > > > <br>
> > > > Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <<br>
> > <br>
> > <a href="mailto:smooney@redhat.com" target="_blank" rel="noreferrer">smooney@redhat.com</a>><br>
> > > > ha scritto:<br>
> > > > <br>
> > > > > On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:<br>
> > > > > > Hello, some updated about this issue.<br>
> > > > > > I read someone has got same issue as reported here:<br>
> > > > > > <br>
> > > > > > <a href="https://bugs.launchpad.net/neutron/+bug/1866139" rel="noreferrer noreferrer" target="_blank">https://bugs.launchpad.net/neutron/+bug/1866139</a><br>
> > > > > > <br>
> > > > > > If you read the discussion, someone tells that the garp must be<br>
> > <br>
> > sent by<br>
> > > > > > qemu during live miration.<br>
> > > > > > If this is true, this means on rocky/stein the qemu/libvirt are<br>
> > <br>
> > bugged.<br>
> > > > > <br>
> > > > > it is not correct.<br>
> > > > > qemu/libvir thas alsway used RARP which predates GARP to serve as<br>
> > <br>
> > its mac<br>
> > > > > learning frames<br>
> > > > > instead<br>
> > <br>
> > <a href="https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol" rel="noreferrer noreferrer" target="_blank">https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol</a><br>
> > > > > <a href="https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html" rel="noreferrer noreferrer" target="_blank">https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html</a><br>
> > > > > however it looks like this was broken in 2016 in qemu 2.6.0<br>
> > > > > <a href="https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html" rel="noreferrer noreferrer" target="_blank">https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html</a><br>
> > > > > but was fixed by<br>
> > > > > <br>
> > <br>
> > <a href="https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b" rel="noreferrer noreferrer" target="_blank">https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b</a><br>
> > > > > can you confirm you are not using the broken 2.6.0 release and are<br>
> > <br>
> > using<br>
> > > > > 2.7 or newer or 2.4 and older.<br>
> > > > > <br>
> > > > > <br>
> > > > > > So I tried to use stein and rocky with the same version of<br>
> > <br>
> > libvirt/qemu<br>
> > > > > > packages I installed on queens (I updated compute and controllers<br>
> > <br>
> > node<br>
> > > > > <br>
> > > > > on<br>
> > > > > > queens for obtaining same libvirt/qemu version deployed on rocky<br>
> > <br>
> > and<br>
> > > > > <br>
> > > > > stein).<br>
> > > > > > <br>
> > > > > > On queens live migration on provider network continues to work<br>
> > <br>
> > fine.<br>
> > > > > > On rocky and stein not, so I think the issue is related to<br>
> > <br>
> > openstack<br>
> > > > > > components .<br>
> > > > > <br>
> > > > > on queens we have only a singel prot binding and nova blindly assumes<br>
> > > > > that the port binding details wont<br>
> > > > > change when it does a live migration and does not update the xml for<br>
> > <br>
> > the<br>
> > > > > netwrok interfaces.<br>
> > > > > <br>
> > > > > the port binding is updated after the migration is complete in<br>
> > > > > post_livemigration<br>
> > > > > in rocky+ neutron optionally uses the multiple port bindings flow to<br>
> > > > > prebind the port to the destiatnion<br>
> > > > > so it can update the xml if needed and if post copy live migration is<br>
> > > > > enable it will asyconsly activate teh dest port<br>
> > > > > binding before post_livemigration shortenting the downtime.<br>
> > > > > <br>
> > > > > if you are using the iptables firewall os-vif will have precreated<br>
> > <br>
> > the<br>
> > > > > ovs port and intermediate linux bridge before the<br>
> > > > > migration started which will allow neutron to wire it up (put it on<br>
> > <br>
> > the<br>
> > > > > correct vlan and install security groups) before<br>
> > > > > the vm completes the migraton.<br>
> > > > > <br>
> > > > > if you are using the ovs firewall os-vif still precreates teh ovs<br>
> > <br>
> > port<br>
> > > > > but libvirt deletes it and recreats it too.<br>
> > > > > as a result there is a race when using openvswitch firewall that can<br>
> > > > > result in the RARP packets being lost.<br>
> > > > > <br>
> > > > > > <br>
> > > > > > Best Regards<br>
> > > > > > Ignazio Cassano<br>
> > > > > > <br>
> > > > > > <br>
> > > > > > <br>
> > > > > > <br>
> > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <<br>
> > > > > <br>
> > > > > <a href="mailto:smooney@redhat.com" target="_blank" rel="noreferrer">smooney@redhat.com</a>><br>
> > > > > > ha scritto:<br>
> > > > > > <br>
> > > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:<br>
> > > > > > > > Hello, I have this problem with rocky or newer with<br>
> > <br>
> > iptables_hybrid<br>
> > > > > > > > firewall.<br>
> > > > > > > > So, can I solve using post copy live migration ???<br>
> > > > > > > <br>
> > > > > > > so this behavior has always been how nova worked but rocky the<br>
> > > > > > > <br>
> > > > > > > <br>
> > > > > <br>
> > > > > <br>
> > <br>
> > <a href="https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html" rel="noreferrer noreferrer" target="_blank">https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html</a><br>
> > > > > > > spec intoduced teh ablity to shorten the outage by pre biding the<br>
> > > > > <br>
> > > > > port and<br>
> > > > > > > activating it when<br>
> > > > > > > the vm is resumed on the destiation host before we get to pos<br>
> > <br>
> > live<br>
> > > > > <br>
> > > > > migrate.<br>
> > > > > > > <br>
> > > > > > > this reduces the outage time although i cant be fully elimiated<br>
> > <br>
> > as<br>
> > > > > <br>
> > > > > some<br>
> > > > > > > level of packet loss is<br>
> > > > > > > always expected when you live migrate.<br>
> > > > > > > <br>
> > > > > > > so yes enabliy post copy live migration should help but be aware<br>
> > <br>
> > that<br>
> > > > > <br>
> > > > > if a<br>
> > > > > > > network partion happens<br>
> > > > > > > during a post copy live migration the vm will crash and need to<br>
> > <br>
> > be<br>
> > > > > > > restarted.<br>
> > > > > > > it is generally safe to use and will imporve the migration<br>
> > <br>
> > performace<br>
> > > > > <br>
> > > > > but<br>
> > > > > > > unlike pre copy migration if<br>
> > > > > > > the guess resumes on the dest and the mempry page has not been<br>
> > <br>
> > copied<br>
> > > > > <br>
> > > > > yet<br>
> > > > > > > then it must wait for it to be copied<br>
> > > > > > > and retrive it form the souce host. if the connection too the<br>
> > <br>
> > souce<br>
> > > > > <br>
> > > > > host<br>
> > > > > > > is intrupted then the vm cant<br>
> > > > > > > do that and the migration will fail and the instance will crash.<br>
> > <br>
> > if<br>
> > > > > <br>
> > > > > you<br>
> > > > > > > are using precopy migration<br>
> > > > > > > if there is a network partaion during the migration the<br>
> > <br>
> > migration will<br>
> > > > > > > fail but the instance will continue<br>
> > > > > > > to run on the source host.<br>
> > > > > > > <br>
> > > > > > > so while i would still recommend using it, i it just good to be<br>
> > <br>
> > aware<br>
> > > > > <br>
> > > > > of<br>
> > > > > > > that behavior change.<br>
> > > > > > > <br>
> > > > > > > > Thanks<br>
> > > > > > > > Ignazio<br>
> > > > > > > > <br>
> > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank" rel="noreferrer">smooney@redhat.com</a>> ha<br>
> > > > > <br>
> > > > > scritto:<br>
> > > > > > > > <br>
> > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote:<br>
> > > > > > > > > > Hello, I have a problem on stein neutron. When a vm migrate<br>
> > > > > <br>
> > > > > from one<br>
> > > > > > > <br>
> > > > > > > node<br>
> > > > > > > > > > to another I cannot ping it for several minutes. If in the<br>
> > <br>
> > vm I<br>
> > > > > <br>
> > > > > put a<br>
> > > > > > > > > > script that ping the gateway continously, the live<br>
> > <br>
> > migration<br>
> > > > > <br>
> > > > > works<br>
> > > > > > > <br>
> > > > > > > fine<br>
> > > > > > > > > <br>
> > > > > > > > > and<br>
> > > > > > > > > > I can ping it. Why this happens ? I read something about<br>
> > > > > <br>
> > > > > gratuitous<br>
> > > > > > > <br>
> > > > > > > arp.<br>
> > > > > > > > > <br>
> > > > > > > > > qemu does not use gratuitous arp but instead uses an older<br>
> > > > > <br>
> > > > > protocal<br>
> > > > > > > <br>
> > > > > > > called<br>
> > > > > > > > > RARP<br>
> > > > > > > > > to do mac address learning.<br>
> > > > > > > > > <br>
> > > > > > > > > what release of openstack are you using. and are you using<br>
> > > > > <br>
> > > > > iptables<br>
> > > > > > > > > firewall of openvswitch firewall.<br>
> > > > > > > > > <br>
> > > > > > > > > if you are using openvswtich there is is nothing we can do<br>
> > <br>
> > until<br>
> > > > > <br>
> > > > > we<br>
> > > > > > > > > finally delegate vif pluging to os-vif.<br>
> > > > > > > > > currently libvirt handels interface plugging for kernel ovs<br>
> > <br>
> > when<br>
> > > > > <br>
> > > > > using<br>
> > > > > > > <br>
> > > > > > > the<br>
> > > > > > > > > openvswitch firewall driver<br>
> > > > > > > > > <a href="https://review.opendev.org/#/c/602432/" rel="noreferrer noreferrer" target="_blank">https://review.opendev.org/#/c/602432/</a> would adress that<br>
> > <br>
> > but it<br>
> > > > > <br>
> > > > > and<br>
> > > > > > > <br>
> > > > > > > the<br>
> > > > > > > > > neutron patch are<br>
> > > > > > > > > <a href="https://review.opendev.org/#/c/640258" rel="noreferrer noreferrer" target="_blank">https://review.opendev.org/#/c/640258</a> rather out dated.<br>
> > <br>
> > while<br>
> > > > > <br>
> > > > > libvirt<br>
> > > > > > > <br>
> > > > > > > is<br>
> > > > > > > > > pluging the vif there will always be<br>
> > > > > > > > > a race condition where the RARP packets sent by qemu and<br>
> > <br>
> > then mac<br>
> > > > > > > <br>
> > > > > > > learning<br>
> > > > > > > > > packets will be lost.<br>
> > > > > > > > > <br>
> > > > > > > > > if you are using the iptables firewall and you have opnestack<br>
> > > > > <br>
> > > > > rock or<br>
> > > > > > > > > later then if you enable post copy live migration<br>
> > > > > > > > > it should reduce the downtime. in this conficution we do not<br>
> > <br>
> > have<br>
> > > > > <br>
> > > > > the<br>
> > > > > > > <br>
> > > > > > > race<br>
> > > > > > > > > betwen neutron and libvirt so the rarp<br>
> > > > > > > > > packets should not be lost.<br>
> > > > > > > > > <br>
> > > > > > > > > <br>
> > > > > > > > > > Please, help me ?<br>
> > > > > > > > > > Any workaround , please ?<br>
> > > > > > > > > > <br>
> > > > > > > > > > Best Regards<br>
> > > > > > > > > > Ignazio<br>
> > > > > > > > > <br>
> > > > > > > > > <br>
> > > > > > > <br>
> > > > > > > <br>
> > > > > <br>
> > > > > <br>
> > <br>
> > <br>
<br>
</blockquote></div>