[rhos-dev] [Neutron] PTG summary

Dan Sneddon dsneddon at redhat.com
Wed Nov 4 06:13:40 UTC 2020



On 11/2/20 1:56 PM, Slawek Kaplonski wrote:
> Hi,
> 
> Below is my summary of the Neutron team sessions which we had during the
> virtual PTG last week.
> Etherpad with notes from the discussions can be found at [1].
> 
> ## Retrospective of the Victoria cycle
>>From the good things during _Victoria_ cycle team pointed:
> * Complete 8 blueprints including the _Metadata over IPv6_ [2],
> * Improved feature parity in the OVN driver,
> * Good review velocity
> 
>>From the not so good things we mentioned:
> * CI instability - average number of rechecks needed to merge patch in last
> year can be found at [3],
> * Too much "Red Hat" in the Neutron team - and percentage of the reviews (and
> patches) done by people from Red Hat is constantly increasing over last few
> cycles. As a main reason of that we pointed that it is hard to change
> companies way of thinking that dedicating developer to the upstream project
> means that they lose developer in downstream.
> * Migration of our CI and Zuul v3 took us a lot of time and effort - but good
> thing is that we accomplished that in the Victoria cycle :)
> 
> During that session we also agreed on some action items for the Wallaby cycle:
> * keep talking about OVN details - Miguel Lavalle and me will work on some way
> to deliver talks about Neutron OVN backend and OVN internals to the community
> during next months. The idea is to maybe propose some talks ever 6 weeks or
> so. This may make ovn driver development more diverse and let operators to
> thing about migration to the ovn backend.
> 
> ## Review of the existing meetings
> We reviewed list of our existing upstream meetings and discussed about ideas
> on how to increase number of attendees on the meetings.
> We decided to:
> * drop neutron-qos meeting as it's not needed anymore
> * advertise more meetings and meetings' agenda on the OpenStack mailing list -
> I will send reminders with links to the agenda before every meeting
> * Together with Lajos Katona we will give some introduction to the debugging
> of CI issues in Neutron*
> 
> ## Support for old versions
> Bernard started discussion about support for the old releases in Neutron and
> Neutron stadium projects.
> For Neutron we decided to mark __Ocata__ branch as unmaintained already as its
> gate is already broken.
> For the __Pike__ and never branches we will keep them in the __EM__ phase as
> there is still some community interest to keep those branches open.
> For the stadium projects we decided to do it similary to what we did while
> looking for new maintainers for the projects. We will send email "call for
> maintainers" for such stable branches. If there will be no voluneers to step
> in, fix gate issues and keep those branches healthy, we will mark them as
> __unmaintained__ and later as __End of Life__ (EOL).
> Currently broken CI is in projects:
> * networking-sfc,
> * networking-bgpvpn/bagpipe,
> * neutron-fwaas
> 
> And those are candidates to be marked as unmaintained if there will be no
> volunteers to fix them.
> Bernard Cafarelli volunteered to work on that in next days/weeks.
> 
> 
> ## Healtcheck API endpoint
> We discussed as our healtcheck API should works. During the discussion we
> decided that:
> * healtcheck result should __NOT__ rely on the agents status, it should rely
> on worker's ability to connect to the DB and MQ (rabbitmq)
> * Lajos will ask community (API experts) about some guidance how it should
> works on the whole OpenStack level,
> * As for reference implementation we can check e.g. Octavia [4] and Keystone
> [5] which already implemented it.
> 
> ## Squash of the DB migration script
> Rodolfo explained us what are benefits of doing such squash of the db migration
> scripts from the old versions:
> * Deployment is faster: we don't need to create/delete tables or create+update
> other ones - the win is small possibly in the magnitude of 5s per job,
> * DB definition is centralized in one place, not in original definition plus
> further migrations - that is most important reason why we really should do
> that,
> * UTs faster: removal of some older checks.
> 
> The problem with this may be that we need to do that carefully and be really
> verbose about with such changes we may break stadium projects or 3rd party
> projects which are doing db migration too.
> To minimalize potential breakage, we will announce such changes on the
> OpenStack discuss mailing list.
> Rodolfo volunteered to take propose squash up to Liberty release in W cycle.
> Together with this squash we will also document that process so in next cycles
> we should be able to do squashes for newer releases in easier way.
> Lajos volunteered to help with fixing Neutron stadium projects if that will be
> needed.
> 
> ## Switch to the new engine facade
> We were discussing how to move on and finally finish old Blueprint [6]. We
> decided that together with Rodolfo we will try how this new engine facade will
> work without using transaction guards in the code. Hopefully that will let us
> move on with this. If not, we will try to reach out to some DB experts for
> some help with this.
> 
> ## Change from rootwrap to the privsep
> This is now community goal during the Wallaby cycle so we need to focus on it
> to accomplish that transition finally.
> This transition may speed up and make our code a bit more secure.
> Rodolfo explained us multiple possible strategies of migration:
> * move to native, e.g.
>    * replace ps with python psutils, not using rootwrap or privsep
>    * replace ip commands with pyroute2, under a privsep context (elevated
> permissions needed)
> * directly translate rootwrap to privsep, executing the same shell command but
> under a privsep context
> 
> To move on with this I will create list of the pieces of code which needs to
> be transitioned in the Neutron repo and in the stadium projects.
> Current todo items can be found on the storyboard [7].
> 
> ## Migration to the NFtables
> During this session we were discussing potential strategies on how to migrate
> from the old iptables to the new nftables. We need to start planning that work
> as it major Linux distributions (e.g. RHEL) are planning to deprecate iptables
> in next releases.
> It seems that currently all major distros (Ubuntu, Centos, OpenSuSE) supports
> nftables already.
> We decided that in Wallaby cycle we will propose new _Manager_ class and we
> will add some config option which will allow people to test new solution.
> In next cycles we will continue work on it to make it stable and to make
> upgrade and migration path for users as easy as possible.
> There is already created blueprint to track progress on that topic [8].
> We need to migrate:
> * Linuxbridge firewall, iptables OVS hybrid firewall,
> * L3 code (legacy router, DVR router, conntrack, port forwarding),
> * iptables metering,
> * metadata proxy,
> * dhcp agent for when it does metadata for isolated networks and namespace
> creation,
> * neutron-vpnaas - ipsec code,
> * and maybe something else what we didn't found yet.
> 
> ## Nova-Neutron cross project session
> We had very interesting discussion with Nova team. We were discussing topics
> like:
> * NUMA affinity in the neutron port
> * vhost-vdpa support
> * default __vnic_type__/__port flavour__
> 
> Notes from that discussion are available in the nova's etherpad [9].
> 
> ## Neutron scalling issues
> At this session we were discussing issues mentioned by operators on the Forum
> sessions a week before the PTG. There was couple of issues mentioned there:
> * problems with retries of the DB operations - we should migrate all our code
> to the oslo.db retries mechanism - new blueprint [10] is created to track
> progress on that one.
> * problems with maintenance of the agents, like e.g. DHCP or L3 agents - many
> of those issues are caused by how our agents are designed and to really fix
> that we would need very deep and huge changes. But also many of those issues
> can be solved by the __ovn__ backend - **and that is strategic direction in
> which neutron wants to go in the next cycles**,
> * Miguel Lavalle and I volunteered to do some profiling of the agents to see
> where we are loosing most of the time - maybe we will be able to find some _low
> hanging fruits_ which can be fixed and improve the situation at least a bit,
> * Similar problem with neutron-ovs-agent and especially security groups which
> are using _remove group id_ as a reference - here we also need some volunteers
> who will try to optimize that.
> 
> ## CI (in)stablility
> On Thursday we were discussing how to improve our very poor CI. Finally we
> decided to:
> * not recheck patches without giving reason of recheck in the comment - there
> should be already reported bug which should be linked in the _recheck_
> comment, or user should open new one and link to it also. IN case if the
> problem was e.g. related to infra some simple comment like _infra issue_ will
> also be enough there,
> * To lower number of existing jobs we will do some changes like:
>    * move *-neutron-lib-master and *-ovs-master jobs to the experimental and
> periodic queues to not run them on every patch,
>    * I will switch _neutron-tempest-plugin-api_ job to be deployed with uwsgi
> so we can drop _neutron-tempest-with-uwsgi_ job,
>    * Consolidate _neutron-tempest-plugin-scenario-linuxbridge_ and _neutron-
> tempest-linuxbridge_ jobs,
>    * Consolidate _neutron-tempest-plugin-scenario-iptables_hybrid and _neutron-
> tempest-iptables_hybrid jobs,
> 
> Later we also discussed about the way how to run or skip tests which can be
> only run when some specific feature is available in the cloud (e.g. _Metadata
> over IPv6_). After some discussion we decided to add new config option with
> list of enabled features. It will be very similar to the existing option
> _api_extensions_. Lajos volunteered to work on that.
> 
> As last CI related topic we discussed about testing DVR in our CI. Oleg
> Bondarev volunteered to check and try to fix broken _neutron-tempest-plugin-
> dvr-multinode-scenario_ job.
> 
> ## Flow based DHCP
> 
> Liu Yulong raised topic about new way of doing fully distributed DHCP service,
> instead of using _DHCP agent_ on the nodes - RFE is proposed at [11]. His
> proposal of doing Open Flow based DHCP (similar to what e.g. ovn-controller is
> doing) is described in [12]. It could be implemented as an L2 agent extension
> and enabled by operators in the config when they would need it.
> As a next step Liu will now propose spec with details about this solution and
> we will continue discussion about it in the spec's review.

When retiring the DHCP agent was discussed in Shanghai it was assumed 
that the flow-based DHCP server would not be compatible with Ironic. 
Currently the OVN native implementation is not compatible and DHCP agent 
is required, but OVN is planning on implementing support for native DHCP 
for Ironic soon (IIUC).

Was there any discussion about what it might take to extend the 
flow-based DHCP server to support direct connection to VLAN/flat 
networks and the DHCP options required for PXE/iPXE for Ironic? Is that 
a possibility in the future, or would we need to continue to maintain 
the DHCP agent even if OVN no longer requires it?

> 
> ## Routed provider networks limited to one host
> 
> As a lost topic on Thursday we briefly talked about old RFE [13]. Miguel
> Lavalle told us that his company, Verizon Media, is interested in working on
> this RFE in next cycles. This also involves some work on Nova's side which was
> started by Sylvain Bauza already. Miguel will sync with Sylvain on that RFE.
> 
> ## L3 feature improvements
> 
> On Friday we were discussing some potential improvements in the L3 area. Lajos
> and Bence shown us some features which their company is interested in and on
> which they plan to work. Those are things like:
> * support for Bidirectional Forwarding Detection
> * some additional API to set additional router parameters like:
>    * ECMP max path,
>    * ECMP hash algorith
> * --provider-allocation-pool parameter in the subnets - in some specific cases
> it may help to use IPs from such _special_ pool for some infrastructure needs,
> more details about that will come in the RFE in future,
> For now all those described above improvements are in very early planning
> phase but Bence will sync with Liu and Liu will dedicate some time to discuss
> progress on them during the __L3 subteam meetings__.

I submitted a spec for installing FRRouting (FRR) via TripleO:

https://review.opendev.org/#/c/758249/

This could be used for ECMP, as well as for routing traffic to the 
HAProxy load balancers fronting the control plane, and advertising 
routes to Neutron IPs on dynamically routed networks (VM IPs and/or 
floating IPs).

The goal is to have a very simple implementation where IP addresses 
would be added to a default or alternate namespace (depending on the use 
case) as loopback addresses with a /32 (v4) or /128 (v6) CIDR. In the 
case of FRR the Zebra daemon receives updates via Netlink when these IP 
addresses are created locally and redistributes them to BGP peers. In 
theory this may allow a different BGP daemon such as Bird or perhaps 
ExaBGP to be easily swapped for FRR.

I will look forward to seeing more on the --provider-allocation-pool 
parameter.

> 
> ## Leveraging routing-on-the-host in Neutron in our next-gen clusters
> 
> As a last topic on Friday we were discussing potential solutions of the _L3 on
> the host_ in the Neutron. The idea here is very similar to what e.g. __Calico
> plugin__ is doing currently.
> More details about potential solutions are described in the etherpad [14].
> During the discussion Dawid Deja from OVH told us that OVH is also using very
> similar, downstream only solution.
> Conclusion of that discussion was that we may have most of the needed code
> already in Neutron and some stadium projects so as a first step people who are
> interested in that topic, like Jan Gutter, Miguel and Dawid will work on some
> deployment guide for such use case.

Is there any public info on the OVH approach available?

> 
> ## Team photo
> During the PTG we also made team photos which You can find at [15].
> 
> [1] https://etherpad.opendev.org/p/neutron-wallaby-ptg
> [2] https://blueprints.launchpad.net/neutron/+spec/metadata-over-ipv6
> [3] https://ibb.co/12sB9N9
> [4] https://opendev.org/openstack/octavia/src/branch/master/octavia/api/
> healthcheck/healthcheck_plugins.py
> [5] https://docs.openstack.org/keystone/victoria/admin/health-check-middleware.html
> [6] https://blueprints.launchpad.net/neutron/+spec/enginefacade-switch
> [7] https://storyboard.openstack.org/#!/story/2007686
> [8] https://blueprints.launchpad.net/neutron/+spec/nftables-migration
> [9] https://etherpad.opendev.org/p/nova-wallaby-ptg
> [10] https://blueprints.launchpad.net/neutron/+spec/oslo-db-retries
> [11] https://bugs.launchpad.net/neutron/+bug/1900934
> [12] https://github.com/gotostack/shanghai_ptg/blob/master/
> shanghai_neutron_ptg_topics_liuyulong.pdf
> [13] https://bugs.launchpad.net/neutron/+bug/1764738
> [14] https://etherpad.opendev.org/p/neutron-routing-on-the-host
> [15] http://kaplonski.pl/files/Neutron_virtual_PTG_October_2020.tar.gz
> 

-- 
Dan Sneddon         |  Senior Principal Software Engineer
dsneddon at redhat.com |  redhat.com/cloud
dsneddon:irc        |  @dxs:twitter




More information about the openstack-discuss mailing list