[stein][neutron] gratuitous arp

Ignazio Cassano ignaziocassano at gmail.com
Fri May 1 16:55:55 UTC 2020


Thanks, have a nice long weekend
Ignazio

Il Ven 1 Mag 2020, 18:47 Sean Mooney <smooney at redhat.com> ha scritto:

> On Fri, 2020-05-01 at 18:34 +0200, Ignazio Cassano wrote:
> > Hello Sean,
> > to be honest I did not understand what is the difference between the
> first
> > and second patch but it is due to my poor skill and my poor english.
> no worries. the first patch is the actual change to add the new config
> option.
> the second patch is just a change to force our ci jobs to enable the
> config option
> we proably dont want to do that permently which is why i have marked it
> [DNM] or "do not merge"
> so it just there to prove the first patch is correct.
>
> > Anycase I would like to test it. I saw I can download files :
> > workaround.py and neutron.py and there is a new option
> > force_legacy_port_binding.
> > How can I test?
> > I must enable the new option under workaround section in the in nova.conf
> > on compute nodes setting  it to true?
> yes that is correct if you apply the first patch you need to set the new
> config
> option in the workarouds section in the nova.conf on the
> contoler.specifcally the conductor
> needs to have this set. i dont think this is needed on the compute nodes
> at least it should not
> need to be set in the compute node nova.conf for the live migration issue.
>
> > The files downloaded (from first or secondo patch?) must be copied on
> > compute nodes under /usr/lib/python2.7/site_packages
> > nova/conf/workaround.py and nova/network/neutron.py and then restart nova
> > compute service?
>
> once we have merged this in master ill backport it to the different
> openstack version back to rocky
> if you want to test it before then the simpelest thing to do is just
> manually make the same change
> unless you are using devstack in which case you could cherry pick the
> cange to whatever branch you are testing.
>
> > It should work only for new instances or also for running instances?
>
> it will apply to all instances. what the cange is doing is disabling our
> detection of
> neutron supprot for the multiple port binding workflow. we still have
> compatibility code for supporting old version of
> neutron. we proably shoudl remove that at some point but when the config
> option is set we will ignore if you are using
> old or new neutorn and just fall back to how we did things before rocky.
>
> in principal that should make live migration have more packet loss but
> since people have reproted it actully fixes the
> issue in this case i have written the patch so you can opt in to the old
> behaviour.
>
> if that work for you in your testing we can continue to keep the
> workaround and old compatibility code until we resolve
> the issue when using the multiple port binding flow.
> > Sorry for disturbing.
>
> dont be sorry it fine to ask questions although just be aware its a long
> weekend so i will not be working monday
> but i should be back on tuseday ill update the patch then with a  release
> note and a unit test and hopefully i can get
> some cores to review it.
> > Best Regards
> > Ignazio
> >
> >
> > Il Mer 29 Apr 2020, 19:49 Sean Mooney <smooney at redhat.com> ha scritto:
> >
> > > On Wed, 2020-04-29 at 17:10 +0200, Ignazio Cassano wrote:
> > > > Many thanks.
> > > > Please keep in touch.
> > >
> > > here are the two patches.
> > > the first https://review.opendev.org/#/c/724386/ is the actual change
> to
> > > add the new config opition
> > > this needs a release note and some tests but it shoudl be functional
> hence
> > > the [WIP]
> > > i have not enable the workaround in any job in this patch so the ci run
> > > will assert this does not break
> > > anything in the default case
> > >
> > > the second patch is https://review.opendev.org/#/c/724387/ which
> enables
> > > the workaround in the multi node ci jobs
> > > and is testing that live migration exctra works when the workaround is
> > > enabled.
> > >
> > > this should work as it is what we expect to happen if you are using a
> > > moderne nova with an old neutron.
> > > its is marked [DNM] as i dont intend that patch to merge but if the
> > > workaround is useful we migth consider enableing
> > > it for one of the jobs to get ci coverage but not all of the jobs.
> > >
> > > i have not had time to deploy a 2 node env today but ill try and test
> this
> > > locally tomorow.
> > >
> > >
> > >
> > > > Ignazio
> > > >
> > > > Il giorno mer 29 apr 2020 alle ore 16:55 Sean Mooney <
> smooney at redhat.com
> > > >
> > > > ha scritto:
> > > >
> > > > > so bing pragmatic i think the simplest path forward given my other
> > >
> > > patches
> > > > > have not laned
> > > > > in almost 2 years is to quickly add a workaround config option to
> > >
> > > disable
> > > > > mulitple port bindign
> > > > > which we can backport and then we can try and work on the actual
> fix
> > >
> > > after.
> > > > > acording to https://bugs.launchpad.net/neutron/+bug/1815989 that
> > >
> > > shoudl
> > > > > serve as a workaround
> > > > > for thos that hav this issue but its a regression in functionality.
> > > > >
> > > > > i can create a patch that will do that  in an hour or so and
> submit a
> > > > > followup DNM patch to enabel the
> > > > > workaound in one of the gate jobs that tests live migration.
> > > > > i have a meeting in 10  mins and need to finish the pacht im
> currently
> > > > > updating but ill submit a poc once that is done.
> > > > >
> > > > > im not sure if i will be able to spend time on the actul fix which
> i
> > > > > proposed last year but ill see what i can do.
> > > > >
> > > > >
> > > > > On Wed, 2020-04-29 at 16:37 +0200, Ignazio Cassano wrote:
> > > > > > PS
> > > > > >  I have testing environment on queens,rocky and stein and I can
> make
> > >
> > > test
> > > > > > as you need.
> > > > > > Ignazio
> > > > > >
> > > > > > Il giorno mer 29 apr 2020 alle ore 16:19 Ignazio Cassano <
> > > > > > ignaziocassano at gmail.com> ha scritto:
> > > > > >
> > > > > > > Hello Sean,
> > > > > > > the following is the configuration on my compute nodes:
> > > > > > > [root at podiscsivc-kvm01 network-scripts]# rpm -qa|grep libvirt
> > > > > > > libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-kvm-4.5.0-33.el7.x86_64
> > > > > > > libvirt-libs-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-network-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-nodedev-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-gluster-4.5.0-33.el7.x86_64
> > > > > > > libvirt-client-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-core-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-logical-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-secret-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-nwfilter-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-scsi-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-rbd-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-config-nwfilter-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-disk-4.5.0-33.el7.x86_64
> > > > > > > libvirt-bash-completion-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-4.5.0-33.el7.x86_64
> > > > > > > libvirt-python-4.5.0-1.el7.x86_64
> > > > > > > libvirt-daemon-driver-interface-4.5.0-33.el7.x86_64
> > > > > > > libvirt-daemon-driver-storage-mpath-4.5.0-33.el7.x86_64
> > > > > > > [root at podiscsivc-kvm01 network-scripts]# rpm -qa|grep qemu
> > > > > > > qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64
> > > > > > > qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64
> > > > > > > libvirt-daemon-driver-qemu-4.5.0-33.el7.x86_64
> > > > > > > centos-release-qemu-ev-1.0-4.el7.centos.noarch
> > > > > > > ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch
> > > > > > > qemu-img-ev-2.12.0-44.1.el7_8.1.x86_64
> > > > > > >
> > > > > > >
> > > > > > > As far as firewall driver
> > > > >
> > > > > /etc/neutron/plugins/ml2/openvswitch_agent.ini:
> > > > > > >
> > > > > > > firewall_driver = iptables_hybrid
> > > > > > >
> > > > > > > I have same libvirt/qemu version on queens, on rocky and on
> stein
> > > > >
> > > > > testing
> > > > > > > environment and the
> > > > > > > same firewall driver.
> > > > > > > Live migration on provider network on queens works fine.
> > > > > > > It does not work fine on rocky and stein (vm lost connection
> after
> > >
> > > it
> > > > >
> > > > > is
> > > > > > > migrated and start to respond only when the vm send a network
> > >
> > > packet ,
> > > > >
> > > > > for
> > > > > > > example when chrony pools the time server).
> > > > > > >
> > > > > > > Ignazio
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Il giorno mer 29 apr 2020 alle ore 14:36 Sean Mooney <
> > > > >
> > > > > smooney at redhat.com>
> > > > > > > ha scritto:
> > > > > > >
> > > > > > > > On Wed, 2020-04-29 at 10:39 +0200, Ignazio Cassano wrote:
> > > > > > > > > Hello, some updated about this issue.
> > > > > > > > > I read someone has got same issue as reported here:
> > > > > > > > >
> > > > > > > > > https://bugs.launchpad.net/neutron/+bug/1866139
> > > > > > > > >
> > > > > > > > > If you read the discussion, someone tells that the garp
> must be
> > > > >
> > > > > sent by
> > > > > > > > > qemu during live miration.
> > > > > > > > > If this is true, this means on rocky/stein the
> qemu/libvirt are
> > > > >
> > > > > bugged.
> > > > > > > >
> > > > > > > > it is not correct.
> > > > > > > > qemu/libvir thas alsway used RARP which predates GARP to
> serve as
> > > > >
> > > > > its mac
> > > > > > > > learning frames
> > > > > > > > instead
> > > > >
> > > > > https://en.wikipedia.org/wiki/Reverse_Address_Resolution_Protocol
> > > > > > > >
> > >
> > > https://lists.gnu.org/archive/html/qemu-devel/2009-10/msg01457.html
> > > > > > > > however it looks like this was broken in 2016 in qemu 2.6.0
> > > > > > > >
> > >
> > > https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04645.html
> > > > > > > > but was fixed by
> > > > > > > >
> > > > >
> > > > >
> > >
> > >
> https://github.com/qemu/qemu/commit/ca1ee3d6b546e841a1b9db413eb8fa09f13a061b
> > > > > > > > can you confirm you are not using the broken 2.6.0 release
> and
> > >
> > > are
> > > > >
> > > > > using
> > > > > > > > 2.7 or newer or 2.4 and older.
> > > > > > > >
> > > > > > > >
> > > > > > > > > So I tried to use stein and rocky with the same version of
> > > > >
> > > > > libvirt/qemu
> > > > > > > > > packages I installed on queens (I updated compute and
> > >
> > > controllers
> > > > >
> > > > > node
> > > > > > > >
> > > > > > > > on
> > > > > > > > > queens for obtaining same libvirt/qemu version deployed on
> > >
> > > rocky
> > > > >
> > > > > and
> > > > > > > >
> > > > > > > > stein).
> > > > > > > > >
> > > > > > > > > On queens live migration on provider network continues to
> work
> > > > >
> > > > > fine.
> > > > > > > > > On rocky and stein not, so I think the issue is related to
> > > > >
> > > > > openstack
> > > > > > > > > components .
> > > > > > > >
> > > > > > > > on queens we have only a singel prot binding and nova blindly
> > >
> > > assumes
> > > > > > > > that the port binding details wont
> > > > > > > > change when it does a live migration and does not update the
> xml
> > >
> > > for
> > > > >
> > > > > the
> > > > > > > > netwrok interfaces.
> > > > > > > >
> > > > > > > > the port binding is updated after the migration is complete
> in
> > > > > > > > post_livemigration
> > > > > > > > in rocky+ neutron optionally uses the multiple port bindings
> > >
> > > flow to
> > > > > > > > prebind the port to the destiatnion
> > > > > > > > so it can update the xml if needed and if post copy live
> > >
> > > migration is
> > > > > > > > enable it will asyconsly activate teh dest port
> > > > > > > > binding before post_livemigration shortenting the downtime.
> > > > > > > >
> > > > > > > > if you are using the iptables firewall os-vif will have
> > >
> > > precreated
> > > > >
> > > > > the
> > > > > > > > ovs port and intermediate linux bridge before the
> > > > > > > > migration started which will allow neutron to wire it up
> (put it
> > >
> > > on
> > > > >
> > > > > the
> > > > > > > > correct vlan and install security groups) before
> > > > > > > > the vm completes the migraton.
> > > > > > > >
> > > > > > > > if you are using the ovs firewall os-vif still precreates
> teh ovs
> > > > >
> > > > > port
> > > > > > > > but libvirt deletes it and recreats it too.
> > > > > > > > as a result there is a race when using openvswitch firewall
> that
> > >
> > > can
> > > > > > > > result in the RARP packets being lost.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > > Ignazio Cassano
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Il giorno lun 27 apr 2020 alle ore 19:50 Sean Mooney <
> > > > > > > >
> > > > > > > > smooney at redhat.com>
> > > > > > > > > ha scritto:
> > > > > > > > >
> > > > > > > > > > On Mon, 2020-04-27 at 18:19 +0200, Ignazio Cassano wrote:
> > > > > > > > > > > Hello, I have this problem with rocky or newer with
> > > > >
> > > > > iptables_hybrid
> > > > > > > > > > > firewall.
> > > > > > > > > > > So, can I solve using post copy live migration ???
> > > > > > > > > >
> > > > > > > > > > so this behavior has always been how nova worked but
> rocky
> > >
> > > the
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > >
> > > > >
> > >
> > >
> https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html
> > > > > > > > > > spec intoduced teh ablity to shorten the outage by pre
> > >
> > > biding the
> > > > > > > >
> > > > > > > > port and
> > > > > > > > > > activating it when
> > > > > > > > > > the vm is resumed on the destiation host before we get
> to pos
> > > > >
> > > > > live
> > > > > > > >
> > > > > > > > migrate.
> > > > > > > > > >
> > > > > > > > > > this reduces the outage time although i cant be fully
> > >
> > > elimiated
> > > > >
> > > > > as
> > > > > > > >
> > > > > > > > some
> > > > > > > > > > level of packet loss is
> > > > > > > > > > always expected when you live migrate.
> > > > > > > > > >
> > > > > > > > > > so yes enabliy post copy live migration should help but
> be
> > >
> > > aware
> > > > >
> > > > > that
> > > > > > > >
> > > > > > > > if a
> > > > > > > > > > network partion happens
> > > > > > > > > > during a post copy live migration the vm will crash and
> need
> > >
> > > to
> > > > >
> > > > > be
> > > > > > > > > > restarted.
> > > > > > > > > > it is generally safe to use and will imporve the
> migration
> > > > >
> > > > > performace
> > > > > > > >
> > > > > > > > but
> > > > > > > > > > unlike pre copy migration if
> > > > > > > > > > the guess resumes on the dest and the mempry page has not
> > >
> > > been
> > > > >
> > > > > copied
> > > > > > > >
> > > > > > > > yet
> > > > > > > > > > then it must wait for it to be copied
> > > > > > > > > > and retrive it form the souce host. if the connection
> too the
> > > > >
> > > > > souce
> > > > > > > >
> > > > > > > > host
> > > > > > > > > > is intrupted then the vm cant
> > > > > > > > > > do that and the migration will fail and the instance will
> > >
> > > crash.
> > > > >
> > > > > if
> > > > > > > >
> > > > > > > > you
> > > > > > > > > > are using precopy migration
> > > > > > > > > > if there is a network partaion during the migration the
> > > > >
> > > > > migration will
> > > > > > > > > > fail but the instance will continue
> > > > > > > > > > to run on the source host.
> > > > > > > > > >
> > > > > > > > > > so while i would still recommend using it, i it just
> good to
> > >
> > > be
> > > > >
> > > > > aware
> > > > > > > >
> > > > > > > > of
> > > > > > > > > > that behavior change.
> > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > Ignazio
> > > > > > > > > > >
> > > > > > > > > > > Il Lun 27 Apr 2020, 17:57 Sean Mooney <
> smooney at redhat.com>
> > >
> > > ha
> > > > > > > >
> > > > > > > > scritto:
> > > > > > > > > > >
> > > > > > > > > > > > On Mon, 2020-04-27 at 17:06 +0200, Ignazio Cassano
> wrote:
> > > > > > > > > > > > > Hello, I have a problem on stein neutron. When a vm
> > >
> > > migrate
> > > > > > > >
> > > > > > > > from one
> > > > > > > > > >
> > > > > > > > > > node
> > > > > > > > > > > > > to another I cannot ping it for several minutes.
> If in
> > >
> > > the
> > > > >
> > > > > vm I
> > > > > > > >
> > > > > > > > put a
> > > > > > > > > > > > > script that ping the gateway continously, the live
> > > > >
> > > > > migration
> > > > > > > >
> > > > > > > > works
> > > > > > > > > >
> > > > > > > > > > fine
> > > > > > > > > > > >
> > > > > > > > > > > > and
> > > > > > > > > > > > > I can ping it. Why this happens ? I read something
> > >
> > > about
> > > > > > > >
> > > > > > > > gratuitous
> > > > > > > > > >
> > > > > > > > > > arp.
> > > > > > > > > > > >
> > > > > > > > > > > > qemu does not use gratuitous arp but instead uses an
> > >
> > > older
> > > > > > > >
> > > > > > > > protocal
> > > > > > > > > >
> > > > > > > > > > called
> > > > > > > > > > > > RARP
> > > > > > > > > > > > to do mac address learning.
> > > > > > > > > > > >
> > > > > > > > > > > > what release of openstack are you using. and are you
> > >
> > > using
> > > > > > > >
> > > > > > > > iptables
> > > > > > > > > > > > firewall of openvswitch firewall.
> > > > > > > > > > > >
> > > > > > > > > > > > if you are using openvswtich there is is nothing we
> can
> > >
> > > do
> > > > >
> > > > > until
> > > > > > > >
> > > > > > > > we
> > > > > > > > > > > > finally delegate vif pluging to os-vif.
> > > > > > > > > > > > currently libvirt handels interface plugging for
> kernel
> > >
> > > ovs
> > > > >
> > > > > when
> > > > > > > >
> > > > > > > > using
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > > > > openvswitch firewall driver
> > > > > > > > > > > > https://review.opendev.org/#/c/602432/ would adress
> that
> > > > >
> > > > > but it
> > > > > > > >
> > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > > > > neutron patch are
> > > > > > > > > > > > https://review.opendev.org/#/c/640258 rather out
> dated.
> > > > >
> > > > > while
> > > > > > > >
> > > > > > > > libvirt
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > > > > pluging the vif there will always be
> > > > > > > > > > > > a race condition where the RARP packets sent by qemu
> and
> > > > >
> > > > > then mac
> > > > > > > > > >
> > > > > > > > > > learning
> > > > > > > > > > > > packets will be lost.
> > > > > > > > > > > >
> > > > > > > > > > > > if you are using the iptables firewall and you have
> > >
> > > opnestack
> > > > > > > >
> > > > > > > > rock or
> > > > > > > > > > > > later then if you enable post copy live migration
> > > > > > > > > > > > it should reduce the downtime. in this conficution
> we do
> > >
> > > not
> > > > >
> > > > > have
> > > > > > > >
> > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > race
> > > > > > > > > > > > betwen neutron and libvirt so the rarp
> > > > > > > > > > > > packets should not be lost.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Please, help me ?
> > > > > > > > > > > > > Any workaround , please ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > Ignazio
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200501/53d4c71a/attachment-0001.html>


More information about the openstack-discuss mailing list