Re: [rhos-dev] [Neutron] PTG summary

4 Nov 2020

      Hi,

Dnia środa, 4 listopada 2020 07:13:40 CET Dan Sneddon pisze:
> On 11/2/20 1:56 PM, Slawek Kaplonski wrote:
> > Hi,
> > 
> > Below is my summary of the Neutron team sessions which we had during the
> > virtual PTG last week.
> > Etherpad with notes from the discussions can be found at [1].
> > 
> > ## Retrospective of the Victoria cycle
> > 
> >>From the good things during _Victoria_ cycle team pointed:
> > * Complete 8 blueprints including the _Metadata over IPv6_ [2],
> > * Improved feature parity in the OVN driver,
> > * Good review velocity
> > 
> >>From the not so good things we mentioned:
> > * CI instability - average number of rechecks needed to merge patch in
> > last
> > year can be found at [3],
> > * Too much "Red Hat" in the Neutron team - and percentage of the reviews
> > (and patches) done by people from Red Hat is constantly increasing over
> > last few cycles. As a main reason of that we pointed that it is hard to
> > change companies way of thinking that dedicating developer to the
> > upstream project means that they lose developer in downstream.
> > * Migration of our CI and Zuul v3 took us a lot of time and effort - but
> > good thing is that we accomplished that in the Victoria cycle :)
> > 
> > During that session we also agreed on some action items for the Wallaby
> > cycle: * keep talking about OVN details - Miguel Lavalle and me will work
> > on some way to deliver talks about Neutron OVN backend and OVN internals
> > to the community during next months. The idea is to maybe propose some
> > talks ever 6 weeks or so. This may make ovn driver development more
> > diverse and let operators to thing about migration to the ovn backend.
> > 
> > ## Review of the existing meetings
> > We reviewed list of our existing upstream meetings and discussed about
> > ideas on how to increase number of attendees on the meetings.
> > We decided to:
> > * drop neutron-qos meeting as it's not needed anymore
> > * advertise more meetings and meetings' agenda on the OpenStack mailing
> > list - I will send reminders with links to the agenda before every
> > meeting * Together with Lajos Katona we will give some introduction to
> > the debugging of CI issues in Neutron*
> > 
> > ## Support for old versions
> > Bernard started discussion about support for the old releases in Neutron
> > and Neutron stadium projects.
> > For Neutron we decided to mark __Ocata__ branch as unmaintained already as
> > its gate is already broken.
> > For the __Pike__ and never branches we will keep them in the __EM__ phase
> > as there is still some community interest to keep those branches open.
> > For the stadium projects we decided to do it similary to what we did
> > while looking for new maintainers for the projects. We will send email
> > "call for maintainers" for such stable branches. If there will be no
> > voluneers to step in, fix gate issues and keep those branches healthy, we
> > will mark them as __unmaintained__ and later as __End of Life__ (EOL).
> > Currently broken CI is in projects:
> > * networking-sfc,
> > * networking-bgpvpn/bagpipe,
> > * neutron-fwaas
> > 
> > And those are candidates to be marked as unmaintained if there will be no
> > volunteers to fix them.
> > Bernard Cafarelli volunteered to work on that in next days/weeks.
> > 
> > 
> > ## Healtcheck API endpoint
> > We discussed as our healtcheck API should works. During the discussion we
> > decided that:
> > * healtcheck result should __NOT__ rely on the agents status, it should
> > rely on worker's ability to connect to the DB and MQ (rabbitmq)
> > * Lajos will ask community (API experts) about some guidance how it should
> > works on the whole OpenStack level,
> > * As for reference implementation we can check e.g. Octavia [4] and
> > Keystone [5] which already implemented it.
> > 
> > ## Squash of the DB migration script
> > Rodolfo explained us what are benefits of doing such squash of the db
> > migration scripts from the old versions:
> > * Deployment is faster: we don't need to create/delete tables or
> > create+update other ones - the win is small possibly in the magnitude of
> > 5s per job, * DB definition is centralized in one place, not in original
> > definition plus further migrations - that is most important reason why we
> > really should do that,
> > * UTs faster: removal of some older checks.
> > 
> > The problem with this may be that we need to do that carefully and be
> > really verbose about with such changes we may break stadium projects or
> > 3rd party projects which are doing db migration too.
> > To minimalize potential breakage, we will announce such changes on the
> > OpenStack discuss mailing list.
> > Rodolfo volunteered to take propose squash up to Liberty release in W
> > cycle. Together with this squash we will also document that process so in
> > next cycles we should be able to do squashes for newer releases in easier
> > way. Lajos volunteered to help with fixing Neutron stadium projects if
> > that will be needed.
> > 
> > ## Switch to the new engine facade
> > We were discussing how to move on and finally finish old Blueprint [6]. We
> > decided that together with Rodolfo we will try how this new engine facade
> > will work without using transaction guards in the code. Hopefully that
> > will let us move on with this. If not, we will try to reach out to some
> > DB experts for some help with this.
> > 
> > ## Change from rootwrap to the privsep
> > This is now community goal during the Wallaby cycle so we need to focus on
> > it to accomplish that transition finally.
> > This transition may speed up and make our code a bit more secure.
> > Rodolfo explained us multiple possible strategies of migration:
> > * move to native, e.g.
> > 
> >    * replace ps with python psutils, not using rootwrap or privsep
> >    * replace ip commands with pyroute2, under a privsep context (elevated
> > 
> > permissions needed)
> > * directly translate rootwrap to privsep, executing the same shell command
> > but under a privsep context
> > 
> > To move on with this I will create list of the pieces of code which needs
> > to be transitioned in the Neutron repo and in the stadium projects.
> > Current todo items can be found on the storyboard [7].
> > 
> > ## Migration to the NFtables
> > During this session we were discussing potential strategies on how to
> > migrate from the old iptables to the new nftables. We need to start
> > planning that work as it major Linux distributions (e.g. RHEL) are
> > planning to deprecate iptables in next releases.
> > It seems that currently all major distros (Ubuntu, Centos, OpenSuSE)
> > supports nftables already.
> > We decided that in Wallaby cycle we will propose new _Manager_ class and
> > we
> > will add some config option which will allow people to test new solution.
> > In next cycles we will continue work on it to make it stable and to make
> > upgrade and migration path for users as easy as possible.
> > There is already created blueprint to track progress on that topic [8].
> > We need to migrate:
> > * Linuxbridge firewall, iptables OVS hybrid firewall,
> > * L3 code (legacy router, DVR router, conntrack, port forwarding),
> > * iptables metering,
> > * metadata proxy,
> > * dhcp agent for when it does metadata for isolated networks and namespace
> > creation,
> > * neutron-vpnaas - ipsec code,
> > * and maybe something else what we didn't found yet.
> > 
> > ## Nova-Neutron cross project session
> > We had very interesting discussion with Nova team. We were discussing
> > topics like:
> > * NUMA affinity in the neutron port
> > * vhost-vdpa support
> > * default __vnic_type__/__port flavour__
> > 
> > Notes from that discussion are available in the nova's etherpad [9].
> > 
> > ## Neutron scalling issues
> > At this session we were discussing issues mentioned by operators on the
> > Forum sessions a week before the PTG. There was couple of issues
> > mentioned there: * problems with retries of the DB operations - we should
> > migrate all our code to the oslo.db retries mechanism - new blueprint
> > [10] is created to track progress on that one.
> > * problems with maintenance of the agents, like e.g. DHCP or L3 agents -
> > many of those issues are caused by how our agents are designed and to
> > really fix that we would need very deep and huge changes. But also many
> > of those issues can be solved by the __ovn__ backend - **and that is
> > strategic direction in which neutron wants to go in the next cycles**,
> > * Miguel Lavalle and I volunteered to do some profiling of the agents to
> > see where we are loosing most of the time - maybe we will be able to find
> > some _low hanging fruits_ which can be fixed and improve the situation at
> > least a bit, * Similar problem with neutron-ovs-agent and especially
> > security groups which are using _remove group id_ as a reference - here
> > we also need some volunteers who will try to optimize that.
> > 
> > ## CI (in)stablility
> > On Thursday we were discussing how to improve our very poor CI. Finally we
> > decided to:
> > * not recheck patches without giving reason of recheck in the comment -
> > there should be already reported bug which should be linked in the
> > _recheck_ comment, or user should open new one and link to it also. IN
> > case if the problem was e.g. related to infra some simple comment like
> > _infra issue_ will also be enough there,
> > 
> > * To lower number of existing jobs we will do some changes like:
> >    * move *-neutron-lib-master and *-ovs-master jobs to the experimental
> >    and
> > 
> > periodic queues to not run them on every patch,
> > 
> >    * I will switch _neutron-tempest-plugin-api_ job to be deployed with
> >    uwsgi
> > 
> > so we can drop _neutron-tempest-with-uwsgi_ job,
> > 
> >    * Consolidate _neutron-tempest-plugin-scenario-linuxbridge_ and
> >    _neutron-
> > 
> > tempest-linuxbridge_ jobs,
> > 
> >    * Consolidate _neutron-tempest-plugin-scenario-iptables_hybrid and
> >    _neutron-> 
> > tempest-iptables_hybrid jobs,
> > 
> > Later we also discussed about the way how to run or skip tests which can
> > be
> > only run when some specific feature is available in the cloud (e.g.
> > _Metadata over IPv6_). After some discussion we decided to add new config
> > option with list of enabled features. It will be very similar to the
> > existing option _api_extensions_. Lajos volunteered to work on that.
> > 
> > As last CI related topic we discussed about testing DVR in our CI. Oleg
> > Bondarev volunteered to check and try to fix broken
> > _neutron-tempest-plugin- dvr-multinode-scenario_ job.
> > 
> > ## Flow based DHCP
> > 
> > Liu Yulong raised topic about new way of doing fully distributed DHCP
> > service, instead of using _DHCP agent_ on the nodes - RFE is proposed at
> > [11]. His proposal of doing Open Flow based DHCP (similar to what e.g.
> > ovn-controller is doing) is described in [12]. It could be implemented as
> > an L2 agent extension and enabled by operators in the config when they
> > would need it.
> > As a next step Liu will now propose spec with details about this solution
> > and we will continue discussion about it in the spec's review.
> 
> When retiring the DHCP agent was discussed in Shanghai it was assumed
> that the flow-based DHCP server would not be compatible with Ironic.
> Currently the OVN native implementation is not compatible and DHCP agent
> is required, but OVN is planning on implementing support for native DHCP
> for Ironic soon (IIUC).
> 
> Was there any discussion about what it might take to extend the
> flow-based DHCP server to support direct connection to VLAN/flat
> networks and the DHCP options required for PXE/iPXE for Ironic? Is that
> a possibility in the future, or would we need to continue to maintain
> the DHCP agent even if OVN no longer requires it?

For now we didn't discuss that. And we also don't have plans to drop support 
for DHCP agent.
This new proposal is for sure not good for every usecase and will be (at least 
for now) proposed as an alternative solution which can be used if features 
like e.g. dns name resolutions are not needed.
In the future we can of course thing about extending this new solution to 
something more complete. 

> 
> > ## Routed provider networks limited to one host
> > 
> > As a lost topic on Thursday we briefly talked about old RFE [13]. Miguel
> > Lavalle told us that his company, Verizon Media, is interested in working
> > on this RFE in next cycles. This also involves some work on Nova's side
> > which was started by Sylvain Bauza already. Miguel will sync with Sylvain
> > on that RFE.
> > 
> > ## L3 feature improvements
> > 
> > On Friday we were discussing some potential improvements in the L3 area.
> > Lajos and Bence shown us some features which their company is interested
> > in and on which they plan to work. Those are things like:
> > * support for Bidirectional Forwarding Detection
> > 
> > * some additional API to set additional router parameters like:
> >    * ECMP max path,
> >    * ECMP hash algorith
> > 
> > * --provider-allocation-pool parameter in the subnets - in some specific
> > cases it may help to use IPs from such _special_ pool for some
> > infrastructure needs, more details about that will come in the RFE in
> > future,
> > For now all those described above improvements are in very early planning
> > phase but Bence will sync with Liu and Liu will dedicate some time to
> > discuss progress on them during the __L3 subteam meetings__.
> 
> I submitted a spec for installing FRRouting (FRR) via TripleO:
> 
> https://review.opendev.org/#/c/758249/
> 
> This could be used for ECMP, as well as for routing traffic to the
> HAProxy load balancers fronting the control plane, and advertising
> routes to Neutron IPs on dynamically routed networks (VM IPs and/or
> floating IPs).
> 
> The goal is to have a very simple implementation where IP addresses
> would be added to a default or alternate namespace (depending on the use
> case) as loopback addresses with a /32 (v4) or /128 (v6) CIDR. In the
> case of FRR the Zebra daemon receives updates via Netlink when these IP
> addresses are created locally and redistributes them to BGP peers. In
> theory this may allow a different BGP daemon such as Bird or perhaps
> ExaBGP to be easily swapped for FRR.

Thx for the info on that. I sent it to Miguel Lavalle and Jan Gutter who were 
mostly interested in that work in upstream.

> 
> I will look forward to seeing more on the --provider-allocation-pool
> parameter.
> 
> > ## Leveraging routing-on-the-host in Neutron in our next-gen clusters
> > 
> > As a last topic on Friday we were discussing potential solutions of the
> > _L3 on the host_ in the Neutron. The idea here is very similar to what
> > e.g. __Calico plugin__ is doing currently.
> > More details about potential solutions are described in the etherpad [14].
> > During the discussion Dawid Deja from OVH told us that OVH is also using
> > very similar, downstream only solution.
> > Conclusion of that discussion was that we may have most of the needed code
> > already in Neutron and some stadium projects so as a first step people who
> > are interested in that topic, like Jan Gutter, Miguel and Dawid will work
> > on some deployment guide for such use case.
> 
> Is there any public info on the OVH approach available?

I don't think so but I can try to explain it to You more or less if You want 
as I was original co-author of that solution many years ago. Please ping me off 
the list if You are interested :)

> 
> > ## Team photo
> > During the PTG we also made team photos which You can find at [15].
> > 
> > [1] https://etherpad.opendev.org/p/neutron-wallaby-ptg
> > [2] https://blueprints.launchpad.net/neutron/+spec/metadata-over-ipv6
> > [3] https://ibb.co/12sB9N9
> > [4] https://opendev.org/openstack/octavia/src/branch/master/octavia/api/
> > healthcheck/healthcheck_plugins.py
> > [5]
> > https://docs.openstack.org/keystone/victoria/admin/health-check-middlewar
> > e.html [6]
> > https://blueprints.launchpad.net/neutron/+spec/enginefacade-switch [7]
> > https://storyboard.openstack.org/#!/story/2007686
> > [8] https://blueprints.launchpad.net/neutron/+spec/nftables-migration
> > [9] https://etherpad.opendev.org/p/nova-wallaby-ptg
> > [10] https://blueprints.launchpad.net/neutron/+spec/oslo-db-retries
> > [11] https://bugs.launchpad.net/neutron/+bug/1900934
> > [12] https://github.com/gotostack/shanghai_ptg/blob/master/
> > shanghai_neutron_ptg_topics_liuyulong.pdf
> > [13] https://bugs.launchpad.net/neutron/+bug/1764738
> > [14] https://etherpad.opendev.org/p/neutron-routing-on-the-host
> > [15] http://kaplonski.pl/files/Neutron_virtual_PTG_October_2020.tar.gz
> 
> --
> Dan Sneddon         |  Senior Principal Software Engineer
> dsneddon@redhat.com |  redhat.com/cloud
> dsneddon:irc        |  @dxs:twitter

-- 
Slawek Kaplonski
Principal Software Engineer
Red Hat