Hello Sean, I am testing the openstack migration on centos 7 train and live migration stops again: live migrated instances stop to responding to ping requests. I did not understand if I must apply patches you suggested in your last email to me and also the following: https://review.opendev.org/c/openstack/nova/+/741529 Il giorno ven 12 mar 2021 alle ore 23:44 Sean Mooney <smooney@redhat.com> ha scritto:
Hello,
If it's the same as us, then yes, the issue occurs on Train and is not completely solved yet.
On Fri, 2021-03-12 at 08:13 +0000, Tobias Urdin wrote: there is a downstream bug trackker for this
https://bugzilla.redhat.com/show_bug.cgi?id=1917675
its fixed by a combination of 3 enturon patches and i think 1 nova one
https://review.opendev.org/c/openstack/neutron/+/766277/ https://review.opendev.org/c/openstack/neutron/+/753314/ https://review.opendev.org/c/openstack/neutron/+/640258/
and https://review.opendev.org/c/openstack/nova/+/770745
the first tree neutron patches would fix the evauate case but break live migration the nova patch means live migration will work too although to fully fix the related live migration packet loss issues you need
https://review.opendev.org/c/openstack/nova/+/747454/4 https://review.opendev.org/c/openstack/nova/+/742180/12 to fix live migration with network abckend that dont suppor tmultiple port binding and https://review.opendev.org/c/openstack/nova/+/602432 (the only one not merged yet.) for live migrateon with ovs and hybridg plug=false (e.g. ovs firewall driver, noop or ovn instead of ml2/ovs.
multiple port binding was not actully the reason for this there was a race in neutorn itslef that would have haapend even without multiple port binding between the dhcp agent and l2 agent.
some of those patches have been backported already and all shoudl eventually make ti to train the could be brought to stine potentially if peopel are open to backport/review them.
Best regards
________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Friday, March 12, 2021 7:43:22 AM To: Tobias Urdin Cc: openstack-discuss Subject: Re: [stein][neutron] gratuitous arp
Hello Tobias, the result is the same as your. I do not know what happens in depth to evaluate if the behavior is the
I solved on stein with patch suggested by Sean : force_legacy_port_bind workaround. So I am asking if the problem exists also on train. Ignazio
Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin@binero.com<mailto: tobias.urdin@binero.com>> ha scritto:
Hello,
Not sure if you are having the same issue as us, but we are following https://bugs.launchpad.net/neutron/+bug/1901707 but
are patching it with something similar to https://review.opendev.org/c/openstack/nova/+/741529 to workaround the issue until it's completely solved.
Best regards
________________________________ From: Ignazio Cassano <ignaziocassano@gmail.com<mailto: ignaziocassano@gmail.com>> Sent: Wednesday, March 10, 2021 7:57:21 AM To: Sean Mooney Cc: openstack-discuss; Slawek Kaplonski Subject: Re: [stein][neutron] gratuitous arp
Hello All, please, are there news about bug 1815989 ? On stein I modified code as suggested in the patches. I am worried when I will upgrade to train: wil this bug persist ? On which openstack version this bug is resolved ? Ignazio
Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto: Hello, I tried to update to last stein packages on yum and seems this bug still exists. Before the yum update I patched some files as suggested and and ping to vm worked fine. After yum update the issue returns. Please, let me know If I must patch files by hand or some new parameters in configuration can solve and/or the issue is solved in newer openstack versions. Thanks Ignazio
Many thanks. Please keep in touch. here are the two patches.
Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney@redhat.com<mailto: smooney@redhat.com>> ha scritto: On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote: the first https://review.opendev.org/#/c/724386/ is the actual change to add the new config opition this needs a release note and some tests but it shoudl be functional hence the [WIP] i have not enable the workaround in any job in this patch so the ci run will assert this does not break anything in the default case
the second patch is https://review.opendev.org/#/c/724387/ which enables the workaround in the multi node ci jobs and is testing that live migration exctra works when the workaround is enabled.
this should work as it is what we expect to happen if you are using a moderne nova with an old neutron. its is marked [DNM] as i dont intend that patch to merge but if the workaround is useful we migth consider enableing it for one of the jobs to get ci coverage but not all of the jobs.
i have not had time to deploy a 2 node env today but ill try and test
same. this locally tomorow.
Ignazio
Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <
ha scritto:
so bing pragmatic i think the simplest path forward given my other
have not laned in almost 2 years is to quickly add a workaround config option to disable mulitple port bindign which we can backport and then we can try and work on the actual fix after. acording to https://bugs.launchpad.net/neutron/+bug/1815989 that shoudl serve as a workaround for thos that hav this issue but its a regression in functionality.
i can create a patch that will do that in an hour or so and submit a followup DNM patch to enabel the workaound in one of the gate jobs that tests live migration. i have a meeting in 10 mins and need to finish the pacht im currently updating but ill submit a poc once that is done.
im not sure if i will be able to spend time on the actul fix which i proposed last year but ill see what i can do.
On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:
PS I have testing environment on queens,rocky and stein and I can make test as you need. Ignazio
Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano < ignaziocassano@gmail.com<mailto:ignaziocassano@gmail.com>> ha scritto:
Hello Sean, the following is the configuration on my compute nodes: [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64 libvirt-daemon-kvm-4.5.0-33.el7.x86_64 libvirt-libs-4.5.0-33.el7.x86_64 libvirt-daemon-driver-network-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64 libvirt-client-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64 libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64 libvirt-daemon-4.5.0-33.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64 libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64 libvirt-bash-completion-4.5.0-33.el7.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64 [root@podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64 qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64 libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64 centos-release-qemu-ev-1.0-4.el7.centos.noarch ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64
As far as firewall driver
/etc/neutron/plugins/ml2/openvswitch_agent.ini:
firewall_driver = iptables_hybrid
I have same libvirt/qemu version on queens, on rocky and on stein
testing
environment and the same firewall driver. Live migration on provider network on queens works fine. It does not work fine on rocky and stein (vm lost connection after it
is
migrated and start to respond only when the vm send a network
for
example when chrony pools the time server).
Ignazio
Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <
smooney@redhat.com<mailto:smooney@redhat.com>>
ha scritto:
> On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote: > > Hello, some updated about this issue. > > I read someone has got same issue as reported here: > > > > https://bugs.launchpad.net/neutron/+bug/1866139 > > > > If you read the discussion, someone tells that the garp must
be
sent by
> > qemu during live miration. > > If this is true, this means on rocky/stein the qemu/libvirt
are
bugged.
> > it is not correct. > qemu/libvir thas alsway used RARP which predates GARP to serve
as
its mac
> learning frames > instead
https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol
>
https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html
> however it looks like this was broken in 2016 in qemu 2.6.0 > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html > but was fixed by >
https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b
> can you confirm you are not using the broken 2.6.0 release and are
using
> 2.7 or newer or 2.4 and older. > > > > So I tried to use stein and rocky with the same version of
libvirt/qemu
> > packages I installed on queens (I updated compute and controllers
node
> > on > > queens for obtaining same libvirt/qemu version deployed on rocky
and
> > stein). > > > > On queens live migration on provider network continues to work
fine.
> > On rocky and stein not, so I think the issue is related to
openstack
> > components . > > on queens we have only a singel prot binding and nova blindly assumes > that the port binding details wont > change when it does a live migration and does not update the xml for
the
> netwrok interfaces. > > the port binding is updated after the migration is complete in > post_livemigration > in rocky+ neutron optionally uses the multiple port bindings flow to > prebind the port to the destiatnion > so it can update the xml if needed and if post copy live migration is > enable it will asyconsly activate teh dest port > binding before post_livemigration shortenting the downtime. > > if you are using the iptables firewall os-vif will have
the
> ovs port and intermediate linux bridge before the > migration started which will allow neutron to wire it up (put
it on
the
> correct vlan and install security groups) before > the vm completes the migraton. > > if you are using the ovs firewall os-vif still precreates teh
ovs
port
> but libvirt deletes it and recreats it too. > as a result there is a race when using openvswitch firewall
> result in the RARP packets being lost. > > > > > Best Regards > > Ignazio Cassano > > > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney < > > smooney@redhat.com<mailto:smooney@redhat.com>> > > ha scritto: > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote: > > > > Hello, I have this problem with rocky or newer with
iptables_hybrid
> > > > firewall. > > > > So, can I solve using post copy live migration ??? > > > > > > so this behavior has always been how nova worked but rocky
> > > > > > > >
https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu...
> > > spec intoduced teh ablity to shorten the outage by pre biding the > > port and > > > activating it when > > > the vm is resumed on the destiation host before we get to
live
> > migrate. > > > > > > this reduces the outage time although i cant be fully
elimiated
as
> > some > > > level of packet loss is > > > always expected when you live migrate. > > > > > > so yes enabliy post copy live migration should help but be
aware
that
> > if a > > > network partion happens > > > during a post copy live migration the vm will crash and
need to
be
> > > restarted. > > > it is generally safe to use and will imporve the migration
performace
> > but > > > unlike pre copy migration if > > > the guess resumes on the dest and the mempry page has not
been
copied
> > yet > > > then it must wait for it to be copied > > > and retrive it form the souce host. if the connection too
souce
> > host > > > is intrupted then the vm cant > > > do that and the migration will fail and the instance will
crash.
if
> > you > > > are using precopy migration > > > if there is a network partaion during the migration the
migration will
> > > fail but the instance will continue > > > to run on the source host. > > > > > > so while i would still recommend using it, i it just good
to be
aware
> > of > > > that behavior change. > > > > > > > Thanks > > > > Ignazio > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <
smooney@redhat.com<mailto:smooney@redhat.com>> ha
> > scritto: > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano wrote: > > > > > > Hello, I have a problem on stein neutron. When a vm migrate > > from one > > > > > > node > > > > > > to another I cannot ping it for several minutes. If in the
vm I
> > put a > > > > > > script that ping the gateway continously, the live
migration
> > works > > > > > > fine > > > > > > > > > > and > > > > > > I can ping it. Why this happens ? I read something about > > gratuitous > > > > > > arp. > > > > > > > > > > qemu does not use gratuitous arp but instead uses an
> > protocal > > > > > > called > > > > > RARP > > > > > to do mac address learning. > > > > > > > > > > what release of openstack are you using. and are you using > > iptables > > > > > firewall of openvswitch firewall. > > > > > > > > > > if you are using openvswtich there is is nothing we can do
until
> > we > > > > > finally delegate vif pluging to os-vif. > > > > > currently libvirt handels interface plugging for kernel ovs
when
> > using > > > > > > the > > > > > openvswitch firewall driver > > > > > https://review.opendev.org/#/c/602432/ would adress
smooney@redhat.com<mailto:smooney@redhat.com>> patches packet , precreated that can the pos the older that
but it
> > and > > > > > > the > > > > > neutron patch are > > > > > https://review.opendev.org/#/c/640258 rather out
dated.
while
> > libvirt > > > > > > is > > > > > pluging the vif there will always be > > > > > a race condition where the RARP packets sent by qemu
and
then mac
> > > > > > learning > > > > > packets will be lost. > > > > > > > > > > if you are using the iptables firewall and you have
opnestack
> > rock or > > > > > later then if you enable post copy live migration > > > > > it should reduce the downtime. in this conficution we do not
have
> > the > > > > > > race > > > > > betwen neutron and libvirt so the rarp > > > > > packets should not be lost. > > > > > > > > > > > > > > > > Please, help me ? > > > > > > Any workaround , please ? > > > > > > > > > > > > Best Regards > > > > > > Ignazio > > > > > > > > > > > > > > > > > >