[stein][neutron] gratuitous arp

Ignazio Cassano ignaziocassano at gmail.com
Fri Mar 12 06:43:22 UTC 2021


Hello Tobias, the result is the same as your.
I do not know what happens in depth to evaluate if the behavior is the
same.
I solved on stein with patch suggested by Sean : force_legacy_port_bind
workaround.
So I am asking if the problem exists also on train.
Ignazio

Il Gio 11 Mar 2021, 19:27 Tobias Urdin <tobias.urdin at binero.com> ha scritto:

> Hello,
>
>
> Not sure if you are having the same issue as us, but we are following
> https://bugs.launchpad.net/neutron/+bug/1901707 but
>
> are patching it with something similar to
> https://review.opendev.org/c/openstack/nova/+/741529 to workaround the
> issue until it's completely solved.
>
>
> Best regards
>
> ------------------------------
> *From:* Ignazio Cassano <ignaziocassano at gmail.com>
> *Sent:* Wednesday, March 10, 2021 7:57:21 AM
> *To:* Sean Mooney
> *Cc:* openstack-discuss; Slawek Kaplonski
> *Subject:* Re: [stein][neutron] gratuitous arp
>
> Hello All,
> please, are there news about bug 1815989 ?
> On stein I modified code as suggested in the patches.
> I am worried when I will upgrade to train: wil this bug persist ?
> On which openstack version this bug is resolved ?
> Ignazio
>
>
>
> Il giorno mer 18 nov 2020 alle ore 07:16 Ignazio Cassano <
> ignaziocassano at gmail.com> ha scritto:
>
>> Hello, I tried to update to last stein packages on yum and seems this bug
>> still exists.
>> Before the yum update I patched some files as suggested and and ping to
>> vm worked fine.
>> After yum update the issue returns.
>> Please, let me know If I must patch files by hand or some new parameters
>> in configuration can solve and/or the issue is solved in newer openstack
>> versions.
>> Thanks
>> Ignazio
>>
>>
>> Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney at redhat.com> ha scritto:
>>
>>> On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote:
>>> > Many thanks.
>>> > Please keep in touch.
>>> here are the two patches.
>>> the first https://review.opendev.org/#/c/724386/ is the actual change
>>> to add the new config opition
>>> this needs a release note and some tests but it shoudl be functional
>>> hence the [WIP]
>>> i have not enable the workaround in any job in this patch so the ci run
>>> will assert this does not break
>>> anything in the default case
>>>
>>> the second patch is https://review.opendev.org/#/c/724387/ which
>>> enables the workaround in the multi node ci jobs
>>> and is testing that live migration exctra works when the workaround is
>>> enabled.
>>>
>>> this should work as it is what we expect to happen if you are using a
>>> moderne nova with an old neutron.
>>> its is marked [DNM] as i dont intend that patch to merge but if the
>>> workaround is useful we migth consider enableing
>>> it for one of the jobs to get ci coverage but not all of the jobs.
>>>
>>> i have not had time to deploy a 2 node env today but ill try and test
>>> this locally tomorow.
>>>
>>>
>>>
>>> > Ignazio
>>> >
>>> > Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <
>>> smooney at redhat.com>
>>> > ha scritto:
>>> >
>>> > > so bing pragmatic i think the simplest path forward given my other
>>> patches
>>> > > have not laned
>>> > > in almost 2 years is to quickly add a workaround config option to
>>> disable
>>> > > mulitple port bindign
>>> > > which we can backport and then we can try and work on the actual fix
>>> after.
>>> > > acording to https://bugs.launchpad.net/neutron/+bug/1815989 that
>>> shoudl
>>> > > serve as a workaround
>>> > > for thos that hav this issue but its a regression in functionality.
>>> > >
>>> > > i can create a patch that will do that  in an hour or so and submit a
>>> > > followup DNM patch to enabel the
>>> > > workaound in one of the gate jobs that tests live migration.
>>> > > i have a meeting in 10  mins and need to finish the pacht im
>>> currently
>>> > > updating but ill submit a poc once that is done.
>>> > >
>>> > > im not sure if i will be able to spend time on the actul fix which i
>>> > > proposed last year but ill see what i can do.
>>> > >
>>> > >
>>> > > On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:
>>> > > > PS
>>> > > >  I have testing environment on queens,rocky and stein and I can
>>> make test
>>> > > > as you need.
>>> > > > Ignazio
>>> > > >
>>> > > > Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano <
>>> > > > ignaziocassano at gmail.com> ha scritto:
>>> > > >
>>> > > > > Hello Sean,
>>> > > > > the following is the configuration on my compute nodes:
>>> > > > > [root at podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt
>>> > > > > libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-kvm-4.5.0-33.el7.x86_64
>>> > > > > libvirt-libs-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-network-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64
>>> > > > > libvirt-client-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64
>>> > > > > libvirt-bash-completion-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64
>>> > > > > libvirt-python-4.5.0-1.el7.x86_64
>>> > > > > libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64
>>> > > > > libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64
>>> > > > > [root at podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu
>>> > > > > qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64
>>> > > > > qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64
>>> > > > > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64
>>> > > > > centos-release-qemu-ev-1.0-4.el7.centos.noarch
>>> > > > > ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch
>>> > > > > qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64
>>> > > > >
>>> > > > >
>>> > > > > As far as firewall driver
>>> > >
>>> > > /etc/neutron/plugins/ml2/openvswitch_agent.ini:
>>> > > > >
>>> > > > > firewall_driver = iptables_hybrid
>>> > > > >
>>> > > > > I have same libvirt/qemu version on queens, on rocky and on stein
>>> > >
>>> > > testing
>>> > > > > environment and the
>>> > > > > same firewall driver.
>>> > > > > Live migration on provider network on queens works fine.
>>> > > > > It does not work fine on rocky and stein (vm lost connection
>>> after it
>>> > >
>>> > > is
>>> > > > > migrated and start to respond only when the vm send a network
>>> packet ,
>>> > >
>>> > > for
>>> > > > > example when chrony pools the time server).
>>> > > > >
>>> > > > > Ignazio
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <
>>> > >
>>> > > smooney at redhat.com>
>>> > > > > ha scritto:
>>> > > > >
>>> > > > > > On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:
>>> > > > > > > Hello, some updated about this issue.
>>> > > > > > > I read someone has got same issue as reported here:
>>> > > > > > >
>>> > > > > > > https://bugs.launchpad.net/neutron/+bug/1866139
>>> > > > > > >
>>> > > > > > > If you read the discussion, someone tells that the garp must
>>> be
>>> > >
>>> > > sent by
>>> > > > > > > qemu during live miration.
>>> > > > > > > If this is true, this means on rocky/stein the qemu/libvirt
>>> are
>>> > >
>>> > > bugged.
>>> > > > > >
>>> > > > > > it is not correct.
>>> > > > > > qemu/libvir thas alsway used RARP which predates GARP to serve
>>> as
>>> > >
>>> > > its mac
>>> > > > > > learning frames
>>> > > > > > instead
>>> > >
>>> > > https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol
>>> > > > > >
>>> https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html
>>> > > > > > however it looks like this was broken in 2016 in qemu 2.6.0
>>> > > > > >
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html
>>> > > > > > but was fixed by
>>> > > > > >
>>> > >
>>> > >
>>> https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b
>>> > > > > > can you confirm you are not using the broken 2.6.0 release and
>>> are
>>> > >
>>> > > using
>>> > > > > > 2.7 or newer or 2.4 and older.
>>> > > > > >
>>> > > > > >
>>> > > > > > > So I tried to use stein and rocky with the same version of
>>> > >
>>> > > libvirt/qemu
>>> > > > > > > packages I installed on queens (I updated compute and
>>> controllers
>>> > >
>>> > > node
>>> > > > > >
>>> > > > > > on
>>> > > > > > > queens for obtaining same libvirt/qemu version deployed on
>>> rocky
>>> > >
>>> > > and
>>> > > > > >
>>> > > > > > stein).
>>> > > > > > >
>>> > > > > > > On queens live migration on provider network continues to
>>> work
>>> > >
>>> > > fine.
>>> > > > > > > On rocky and stein not, so I think the issue is related to
>>> > >
>>> > > openstack
>>> > > > > > > components .
>>> > > > > >
>>> > > > > > on queens we have only a singel prot binding and nova blindly
>>> assumes
>>> > > > > > that the port binding details wont
>>> > > > > > change when it does a live migration and does not update the
>>> xml for
>>> > >
>>> > > the
>>> > > > > > netwrok interfaces.
>>> > > > > >
>>> > > > > > the port binding is updated after the migration is complete in
>>> > > > > > post_livemigration
>>> > > > > > in rocky+ neutron optionally uses the multiple port bindings
>>> flow to
>>> > > > > > prebind the port to the destiatnion
>>> > > > > > so it can update the xml if needed and if post copy live
>>> migration is
>>> > > > > > enable it will asyconsly activate teh dest port
>>> > > > > > binding before post_livemigration shortenting the downtime.
>>> > > > > >
>>> > > > > > if you are using the iptables firewall os-vif will have
>>> precreated
>>> > >
>>> > > the
>>> > > > > > ovs port and intermediate linux bridge before the
>>> > > > > > migration started which will allow neutron to wire it up (put
>>> it on
>>> > >
>>> > > the
>>> > > > > > correct vlan and install security groups) before
>>> > > > > > the vm completes the migraton.
>>> > > > > >
>>> > > > > > if you are using the ovs firewall os-vif still precreates teh
>>> ovs
>>> > >
>>> > > port
>>> > > > > > but libvirt deletes it and recreats it too.
>>> > > > > > as a result there is a race when using openvswitch firewall
>>> that can
>>> > > > > > result in the RARP packets being lost.
>>> > > > > >
>>> > > > > > >
>>> > > > > > > Best Regards
>>> > > > > > > Ignazio Cassano
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <
>>> > > > > >
>>> > > > > > smooney at redhat.com>
>>> > > > > > > ha scritto:
>>> > > > > > >
>>> > > > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:
>>> > > > > > > > > Hello, I have this problem with rocky or newer with
>>> > >
>>> > > iptables_hybrid
>>> > > > > > > > > firewall.
>>> > > > > > > > > So, can I solve using post copy live migration ???
>>> > > > > > > >
>>> > > > > > > > so this behavior has always been how nova worked but rocky
>>> the
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > >
>>> > > > > >
>>> > >
>>> > >
>>> https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html
>>> > > > > > > > spec intoduced teh ablity to shorten the outage by pre
>>> biding the
>>> > > > > >
>>> > > > > > port and
>>> > > > > > > > activating it when
>>> > > > > > > > the vm is resumed on the destiation host before we get to
>>> pos
>>> > >
>>> > > live
>>> > > > > >
>>> > > > > > migrate.
>>> > > > > > > >
>>> > > > > > > > this reduces the outage time although i cant be fully
>>> elimiated
>>> > >
>>> > > as
>>> > > > > >
>>> > > > > > some
>>> > > > > > > > level of packet loss is
>>> > > > > > > > always expected when you live migrate.
>>> > > > > > > >
>>> > > > > > > > so yes enabliy post copy live migration should help but be
>>> aware
>>> > >
>>> > > that
>>> > > > > >
>>> > > > > > if a
>>> > > > > > > > network partion happens
>>> > > > > > > > during a post copy live migration the vm will crash and
>>> need to
>>> > >
>>> > > be
>>> > > > > > > > restarted.
>>> > > > > > > > it is generally safe to use and will imporve the migration
>>> > >
>>> > > performace
>>> > > > > >
>>> > > > > > but
>>> > > > > > > > unlike pre copy migration if
>>> > > > > > > > the guess resumes on the dest and the mempry page has not
>>> been
>>> > >
>>> > > copied
>>> > > > > >
>>> > > > > > yet
>>> > > > > > > > then it must wait for it to be copied
>>> > > > > > > > and retrive it form the souce host. if the connection too
>>> the
>>> > >
>>> > > souce
>>> > > > > >
>>> > > > > > host
>>> > > > > > > > is intrupted then the vm cant
>>> > > > > > > > do that and the migration will fail and the instance will
>>> crash.
>>> > >
>>> > > if
>>> > > > > >
>>> > > > > > you
>>> > > > > > > > are using precopy migration
>>> > > > > > > > if there is a network partaion during the migration the
>>> > >
>>> > > migration will
>>> > > > > > > > fail but the instance will continue
>>> > > > > > > > to run on the source host.
>>> > > > > > > >
>>> > > > > > > > so while i would still recommend using it, i it just good
>>> to be
>>> > >
>>> > > aware
>>> > > > > >
>>> > > > > > of
>>> > > > > > > > that behavior change.
>>> > > > > > > >
>>> > > > > > > > > Thanks
>>> > > > > > > > > Ignazio
>>> > > > > > > > >
>>> > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <
>>> smooney at redhat.com> ha
>>> > > > > >
>>> > > > > > scritto:
>>> > > > > > > > >
>>> > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano
>>> wrote:
>>> > > > > > > > > > > Hello, I have a problem on stein neutron. When a vm
>>> migrate
>>> > > > > >
>>> > > > > > from one
>>> > > > > > > >
>>> > > > > > > > node
>>> > > > > > > > > > > to another I cannot ping it for several minutes. If
>>> in the
>>> > >
>>> > > vm I
>>> > > > > >
>>> > > > > > put a
>>> > > > > > > > > > > script that ping the gateway continously, the live
>>> > >
>>> > > migration
>>> > > > > >
>>> > > > > > works
>>> > > > > > > >
>>> > > > > > > > fine
>>> > > > > > > > > >
>>> > > > > > > > > > and
>>> > > > > > > > > > > I can ping it. Why this happens ? I read something
>>> about
>>> > > > > >
>>> > > > > > gratuitous
>>> > > > > > > >
>>> > > > > > > > arp.
>>> > > > > > > > > >
>>> > > > > > > > > > qemu does not use gratuitous arp but instead uses an
>>> older
>>> > > > > >
>>> > > > > > protocal
>>> > > > > > > >
>>> > > > > > > > called
>>> > > > > > > > > > RARP
>>> > > > > > > > > > to do mac address learning.
>>> > > > > > > > > >
>>> > > > > > > > > > what release of openstack are you using. and are you
>>> using
>>> > > > > >
>>> > > > > > iptables
>>> > > > > > > > > > firewall of openvswitch firewall.
>>> > > > > > > > > >
>>> > > > > > > > > > if you are using openvswtich there is is nothing we
>>> can do
>>> > >
>>> > > until
>>> > > > > >
>>> > > > > > we
>>> > > > > > > > > > finally delegate vif pluging to os-vif.
>>> > > > > > > > > > currently libvirt handels interface plugging for
>>> kernel ovs
>>> > >
>>> > > when
>>> > > > > >
>>> > > > > > using
>>> > > > > > > >
>>> > > > > > > > the
>>> > > > > > > > > > openvswitch firewall driver
>>> > > > > > > > > > https://review.opendev.org/#/c/602432/ would adress
>>> that
>>> > >
>>> > > but it
>>> > > > > >
>>> > > > > > and
>>> > > > > > > >
>>> > > > > > > > the
>>> > > > > > > > > > neutron patch are
>>> > > > > > > > > > https://review.opendev.org/#/c/640258 rather out
>>> dated.
>>> > >
>>> > > while
>>> > > > > >
>>> > > > > > libvirt
>>> > > > > > > >
>>> > > > > > > > is
>>> > > > > > > > > > pluging the vif there will always be
>>> > > > > > > > > > a race condition where the RARP packets sent by qemu
>>> and
>>> > >
>>> > > then mac
>>> > > > > > > >
>>> > > > > > > > learning
>>> > > > > > > > > > packets will be lost.
>>> > > > > > > > > >
>>> > > > > > > > > > if you are using the iptables firewall and you have
>>> opnestack
>>> > > > > >
>>> > > > > > rock or
>>> > > > > > > > > > later then if you enable post copy live migration
>>> > > > > > > > > > it should reduce the downtime. in this conficution we
>>> do not
>>> > >
>>> > > have
>>> > > > > >
>>> > > > > > the
>>> > > > > > > >
>>> > > > > > > > race
>>> > > > > > > > > > betwen neutron and libvirt so the rarp
>>> > > > > > > > > > packets should not be lost.
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > > Please, help me ?
>>> > > > > > > > > > > Any workaround , please ?
>>> > > > > > > > > > >
>>> > > > > > > > > > > Best Regards
>>> > > > > > > > > > > Ignazio
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > >
>>> > > > > >
>>> > >
>>> > >
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210312/215d9fb5/attachment-0001.html>


More information about the openstack-discuss mailing list