[openstack-dev] [Neutron] Stein PTG Summary

Miguel Lavalle miguel at mlavalle.com
Mon Sep 24 21:38:59 UTC 2018

Dear Neutron team,

Thank you very much for your hard work during the PTG in Denver, Thanks to
your efforts, we had a very productive week and we planned and prioritized
a lot of work to be done during the Stein cycle. Following below is a high
level summary of the discussions we had. If there is something I left out,
please reply to this email thread to add it. However, if you want to
continue the discussion on any of the individual points summarized below,
please start a new thread, so we don't have a lot of conversations going on
attached to this update. You can find the etherpad we used during the PTG
meetings here: https://etherpad.openstack.org/p/neutron-stein-ptg.


* The following blueprints were not finished during Rocky and have been
rolled over to Stein:

   - Policy in code is now targeted for Stein-1. Akihiro Amotoki is in
charge of this implementation:
   - Strict minimum bandwidth support (scheduling aware) is targeted for
Stein-2. The assignees on the Neutron side are Bence Romsics and Lajos
Katona. Balazs Gibizer is the assignee on the Nova side. There are no
blockers for this feature to be implemented in either side.
   - Enable adoption of an existing subnet into a subnetpool is targeted
for Stein-2. Bernard Cafarelli has taken over the implementation of this
blueprint: https://blueprints.launchpad.net/neutron/+spec/subnet-onboard
   - Decoupling database imports/access for neutron-lib. The limiting
factor in this effort is reviews velocity both in Neutron and the out of
tree related projects. Attention will be placed to review patches promptly,
to try to  finish this effort in Stein:

* One area of concern at the end of Rocky is how to reduce the likelihood
of patches merged in Neutron breaking Stadium and networking projects

   - The approach to be adopted is to add one non-voting job per Stadium
project to the Neutron check queue. Miguel Lavalle will send a message to
the ML asking for job proposals from the Stadium projects
   - Non Stadium networking projects are also invited to add 3rd party CI
jobs, similar to what is done in projects such as Cinder. An example patch
is here: https://review.openstack.org/#/c/604382/.  Boden Russell indicated
he will follow this approach in the case of openstack/vmware-nsx.
   - Another alternative that was considered was to release Neutron once a
month so Stadium and networking projects can quickly go back to a closer
stable point. The consensus of the team, though, gravitated more towards
non voting Stadium and 3rf party CI jobs as described in the previous two
   - Miguel Lavalle will update the code reviews section of the
documentation (
with guidelines on how to use the http://codesearch.openstack.org/ on line
tool to spot impacts of Neutron patches in the Stadium and other related

SR-IOV VF to VF mirroring

* Blueprint
proposes to add to Neutron and TaaS (Tap-as-a-Service,
https://github.com/openstack/tap-as-a-service) the capability to do SR-IOV
VF to VF mirroring.

   - A demo of how this can be implemented is here
https://etherpad.net/p/taas_sriov_demo_stein_ptg and here
   - The spec for this effort (https://review.openstack.org/#/c/574477)
proposes to implement TaaS agent and driver to support SR-IOV VF to VF
mirroring. This implies the implementation of a framework within TaaS to
manage several types of agents
   - The spec also proposes that the way to specify vlans to be mirrored
will be a "vlan_mirror_list" field in the binding profile of the port
associated to the TaaS Tap Service. There was feedback from the room that
the vlans may be specified instead in the TaaS Tap Flow. Two alternatives
were suggested. The first one is to add to the Tap Flow a "vlan_filter"
attribute. The second one is to add to the Tap Flow the UUID of a
classifier, which can be from the CCF (Common Classifier Framework) or, if
CCF is not ready, from a classifier developed in TaaS. These alternatives
will be discussed in the spec, to give developers in the broader community
the opportunity to influence the decision
   - To support this blueprint, and effort will be made to make TaaS ready
for the Neutron Stadium of projects. The guidelines and checklist to be
used to assess TaaS readiness for the Stadium are outlined here:
Miguel Lavalle and Munish Mehan will work on a patch to conduct an
assessment similar to this one: https://review.openstack.org/#/c/506012/

Testing multinode datapath with Skydive

* Miguel Ajo and Daniel Alvarez presented a demo of Skydive:
https://github.com/skydive-project/skydive and
* Skydive is an open source real-time network topology and protocols
analyzer. It aims to provide a comprehensive way of understanding what is
happening in the network infrastructure. Skydive agents collect topology
information and flows and forward them to a central agent for further
* The proposal is to transform the whitebox tests we have in tempest for
DVR, or HA, which assume specific internal knowledge of the reference
implementation, and instead, capture the specific traffic over each node
with the skydive API and check the traffic is being handled where and how
we want (DVR, HA routers, etc).
* The demo script is here:
* The demo was well received by the team. The next steps are:

   - Miguel Ajo to talk to Federico Ressi to agree on tests to be committed
by the end of Stein
   - If they agree to commit tests by the end of Stein, Miguel Lavalle will
create a blueprint to track the effort

L3 topics

* Swaminathan Vasudevan presented remotely to the team the alternatives to
fix the long standing bug https://bugs.launchpad.net/neutron/+bug/1774459,
Update permanent ARP entries for allowed_address_pair IPs in DVR Routers.

   - This bug refers to allowed_address_pairs IP associated with unbound
ports and DVR routers. The ARP entry for the allowed_address_pair IP does
not change based on the GARP issued by any Keepalived instance, since DVR
does the ARP table update through the control plane, and does not allow any
ARP requests to get out of the node.
   - Swami was seeking the advice from people with more L2 / OpenFlow
knowledge on how to address this issue
   - With significant input from Miguel Ajo and Daniel Alvarez, the agreed
upon plan is to intercept GARP packets and forward them to the local
controller for processing. Basically flows will be programmed dynamically
when a GARP is recognized by the controller

* Next L3 topic discussed was the automatic re-balancing of DHCP agents and
L3 routers. Frequently, after starting and stopping  nodes during normal
operations, some network nodes might end up overloaded hosting DHCP servers
and L3 routers. The team agreed on the following approaches:

   - No automatic re-balancing, since they may lead to long transitions in
the system
   - For HA routers, Keepalived priorities will be used to maintain the
   - For DVR centralized routers, legacy routers and DHCP agent, scripts to
be executed by the Cloud admin will be the way to go. Operators can
customize these scripts to better fit their needs
   - Neutron documentation will also be improved in this area

* Miguel Lavalle presented https://bugs.launchpad.net/neutron/+bug/1789391,
which proposes to implement VPC peering for Neutron routers.

   - Some private cloud offering (Huawei's among them) give the users the
ability to implement AWS VPC functionality using Neutron routers
   - The next logical step may be VPC peeing by enabling Neutron routers to
talk to each other.
   - The proposal was well received and accepted by the team. The next step
is to write a spec to iron out the technical details

neutron-lib topics

* The first topic was how to identify projects that are "current" and hence
want the ongoing neutron-lib consumption patches created by Boden Russel

   - Up until now we look at the neutron-lib version in requirements.txt to
see which projects are "up to date" and should get consumption patches. Becomes
difficult to search and easily find these current consumers.
   - It was agreed that from now on, an active "opt in" mechanism will be
implemented. It will consist of a comment in a project file or a project
tag. Boden will send a message to the ML to outline the process

* For those projects that opt in to neutron-lib consumption patches

   - Boden is willing to help them set up for Zuul V3
   - https://etherpad.openstack.org/p/neutron-sibling-setup contains a list
of "current" networking projects under "List of networking related projects
that are current and what they are missing". All projects need local tox
targets. Only two appear to require zuul updates
   - If projects have to pull Tempest tests to their own Tempest plugin,
they will have to do it themselves. However, it appears all have moved
their tempest code out of tree

* The team discussed the possibility to test neutron changes with
neutron-lib from master branch instead of the latest released version

   - Today we have the "dummy patch" approach
  It seems this could be done by having the Neutron change tested with a
dummy neutron-lib patch that depends on it; this would run the
"neutron-src" zuul jobs on the dummy lib patch, using the respective
neutron master patch
   - After further testing, it was discovered that the "dummy patch"
approach is not working as expected. As a consequence, Boden has proposed
the following patch to start testing Neutron patches with neutron-lib
master: https://review.openstack.org/#/c/602748/

* The team agreed that a spec is needed on how to handle API definitions
for extensions that extend a dynamic set of resources. Volunteers will be
requested for this (see next bullet)

* Nate Johnston indicated that developers that want to help are not sure
where to start, and suggested the creation of a punch list of the remaining
items to be migrated

   - Miguel Lavalle will send message to the ML asking for volunteers for
the neutron-lib effort
   - If we get volunteers, Boden will update the punch list. A starting
point could be to drive the "dynamic API extensions" topic mentioned in the
previous bullet

SmartNIC support

* Lianhao Lu and Isaku Yamahata submitted to the consideration of the team
two specs to support smart nics in Neutron:
https://review.openstack.org/#/c/595402/ and

* In the context of this topic, a SmartNIC is a card that runs OVS, which
enables its remote control over the OVSDB protocol

* The overall goal is to significantly increase the number of Ironic
compute hosts that can be managed in a deployment.

   - The general idea is to create a “super OVS agent”, running in the
OpenStack controller, that will have multiple threads configuring SmartNICs
in the compute hosts using OVSDB, eliminating the need to have an agent in
each host and eliminating, as a consequence, the use of communications over
the RPC channel, which has been identified as a bottleneck in the number of
compute hosts that a deployment can handle.
   - This approach can be extended to VMs based with no SmartNICs
deployments, by configuring OVS in the compute hosts to be managed remotely
over OVSDB by the “super OVS agent”. This would enable to increase the
maximum number of compute hosts that Neutron can manage in a single

* The proposal was well received by the team and it was agreed that the
next step is to add more detail to the specs under review, with the aim to
clarify the technical details of the implementation

   - One that requires special clarification is how the proposed changes
will impact the L2pop mechanism driver. This clarification might take place
in a separate spec

Ironic x-project discussion - Smartnics

* In this session, the Ironic and Neutron teams got together to explore
further using SmartNICs to increase the number of compute hosts that can be
supported in one deployment (see above "SmartNIC support" topic)

   - The session was greatly facilitated by the fact that the Neutron team
had already agreed the evening before to go ahead with SmartNICs support
   - There was some discussion as to how Neutron is going to discover the
credentials that will be used by the “super OVS agent” to manage OVS in the
SmartNICs. The alternatives considered were to include these credentials in
the port binding profile or the use of ReST calls. The final decision
between these alternatives will be made in the related specs: on the Ironic
side https://review.openstack.org/#/c/582767/ and on the Neutron side
https://review.openstack.org/#/c/595402/ and
   - Julia Kreger will propose a joint Forum session for the Berlin Summit
to review progress. The is to have the specs finished  when that session
takes place

StarlingX feature upstream

* Almost all the morning on Thursday 13th was devoted to the review and
discussion of StarlingX specs

* In preparation for this session, several Neutron team members conducted
on Monday 10th a "review sprint" of the specs submitted by the StarlingX
team, providing feedback in Gerrit

* We started this session with Matt Peters of Windriver giving an overview
to the Neutron team of the goals and technical architecture of StarlingX
project. This presentation, along with the feedback that the Neutron team
provided beforehand in the specs, really went a long way on eliminating
misunderstandings on both sides, paving the way to good agreements for both
sides on all the specs:

   - Provider Network Management https://review.openstack.org/599980. Much
of the problem with this spec was nomenclature mis-understanding. We all
agreed to move ahead with this spec, adjusting the use of the “provider
network” term, creating one new resource in the Neutron API to manage all
the configuration  options that will be managed by API calls, which will
override values set in files. The Oslo team will be consulted on this
   - System Host Management https://review.openstack.org/599981. After
clarifying the real requirement in this spec is the capability to set and
agent administratively down, the team agreed to continue the development of
this spec using the existing Boolean attribute admin_state_up in DHCP and
L3 agents
   - Fault Management https://review.openstack.org/599982. The team agreed
that this is not in the scope of Neutron The StarlingX agreed to drop this
   - Host Bindings of Provider Networks https://review.openstack.org/579410.
The underlying need in this spec is the ability to handle more than one L2
agent running in a compute host. The StarlingX team agreed to address this
need using existing Neutron facilities and features
   - Segment Range Management of Self-service Networks:
https://review.openstack.org/579411. This spec will be dropped by the
StarlingX and folded under Provider Network Management
https://review.openstack.org/599980 above
   - Rescheduling of DHCP Servers and Routers:
https://review.openstack.org/595978. Given that Neutron already has an API
call to re-schedule routers and in light of the discussion on Wednesday
about agents re-balancing, the StarlingX will reevaluate their proposed
approach in this spec
   - ML2 connection auditing and monitoring
https://review.openstack.org/#/c/589313. There is enough data in Neutron
log files to enable external management systems (Nagios for example) to
address this need, so this spec will be dropped. The StarlingX team might
follow up with a request to improve log messages

* Miguel Lavalle invited the StarlingX to add topics to the weekly team
meeting in the "On demand agenda" section, whenever further conversation is
deemed necessary. Miguel will also participate once a month in the
StarlingX networking meeting

Minimum bandwidth scheduling demo

* After lunch, there was a live demo of the bandwidth based scheduling
feature, conducted by Bence Romsics and Balazs Gibizer.

   - The demo was successful and it showed that the Nova and Neutron team
are well on their way to finish the implementation of this feature during
the Stein cycle

Nova x-project discussion

* The session started with a discussion of the recently merged (Rocky
cycle) multiple port binding feature

   - Sean Mooney of Redhat has been testing this feature by moving VMs
across hosts with heterogeneous Neutron back-ends (OVS, Linuxbridge, etc.)
   - The current code provides 90% of the functionality needed but some
gaps remain
   - Live migration between different firewall drivers works
   - Live migration between different OVS  and Linux bridge almost works.
These bugs have been filed:  https://bugs.launchpad.net/neutron/+bug/1788012
and https://bugs.launchpad.net/neutron/+bug/1788009. Sean will open a bug
for the fact that Nova libvirt driver does not use the bridge name from
destination binding
   - Live migration fails between kernel vhost and vhost-user. This is
because Nova sets the MTU for kernel vhost tap devices in libvirt xml but
doesn't set the MTU in the XML for vhost user. Sean will file a bug and fix
it in Nova
   - The agreement was that Sean will fix these bugs during the Stein cycle

* Work on bandwidth based scheduling is progressing at a good pace and is
expected to be finished by Stein-2. Currently, there are no blockers,
either on the Nova or the Neutron side

   - There are specs for placement currently under review that originated
in the work being done in bandwidth based scheduling: any traits support to
allow modeling bandwidth requests for multi segment networks:
https://review.openstack.org/#/c/565730 and
<https://review.openstack.org/#/c/565741/>  sub-tree filter for GET
/resource_providers to allow easier inventory handling for Neutron:
https://review.openstack.org/#/c/595236/ and resource provider - request
group mapping in allocation candidate to have a scalable way to figure out
which RP provides resources for which Neutron port during server create:

* Making Neutron the only network back-end for nova. e.g. deleting Nova

   - There are still users (CERN) who need Nova Networks
   - Some progress can be made in Stein on cleaning up Nova unit/functional
tests to stop using the nova-network specific stubs and move those over to
using NeutronFixture. This will receive very low review attention

* portvbindings v3 e.g. extending Neutron to return os-vif objects when
binding a port

   - This is just to finish the original plan for os-vif
   - The agreement was that this should be done but with a very low
priority in this cycle

Cyborg x-project discussion

*Sundar Nadathu presented  a proposal to enable the joint management of
NICs with FPGA capabilities:

* An explanation of how ML2 mechanism drivers work was given to the Cyborg

* The agreements were:

   - Sundar to submit a spec to develop a ML2 mechanism driver to handle
the binding of Neutron ports with this type of cards
   - Create a project under Neutron governance, possibly named
networking-fpga, to be the repository for the mechanism driver mentioned in
the previous point

Python-3 goal tests

* Nate Johnston is leading the Neutron effort to comply with this community
goal: https://governance.openstack.org/tc/goals/stein/python3-first.html
* The decisions made during this session were:

   - We will run unit and functional in python 2.7 during Stein. In the T
cycle we will get rid of the functional job. In the U cycle, we will let go
of Python 2.7 completely
   - We will switch all jobs to py36
   - Make openstack-tox-py36 to be voting and gating job
   - A message will be sent to the mailing list offering Nate's assistance
to the Neutron Stadium projects with the Python 3 transition

Neutron upgrades/OVO

* OVO adoption continues making steady progress

   - More contributors are welcome. The backlog is kept here:
   - The best document for new contributors looking to become familiar with
OVO is here:

* The OVO sub-team has accepted to also continue the adoption of the new
engine facade (
https://blueprints.launchpad.net/neutron/+spec/enginefacade-switch), since
both efforts work in the DB layer and entail traversing the code looking
for opportunities of adoption

   - Neutron OVO objects are implemented with support for the new engine
facade in the base class but it is currently disabled globally. It will be
enabled on a per object basis

SNAT logging extension

* Yushiro Furukawa proposed to extend the work that has been done so far in
logging (security groups and FWaaS) to SNAT
* The team agreed that SNAT is a sensible next step for logging
* The work plan is the following:

   - Migrate libnetfilter_log  from neutron-fwaas into neutron-lib (in
order to call this driver from SNAT logging):
https://review.openstack.org/#/c/603310. The target for this is Stein-1
   - RPC, agent extension, doc and CLI implementation in Neutron targeted
for Stein-2
   - Testing is targeted for Stein-3

Performance and scalability

* Nate Johnston will implement the "Speed up Neutron port bulk creation"

   - This is in support of certain kuryr use cases

* Slowness in the Neutron API. Slawek Kaplonski shared with the team the
response time of Neutron ReST API requests across several check queue jobs.
For each job, results are shown for a timed out and a successful run:
neutron-tempest-dvr - http://paste.openstack.org/show/729482/,
neutron-tempest-iptables_hybrid - http://paste.openstack.org/show/729485/,
neutron-tempest-linuxbridge - http://paste.openstack.org/show/729487/,
neutron-tempest-plugin-dvr-multinode-scenario -
http://paste.openstack.org/show/729488/ and tempest-full-py3 -
http://paste.openstack.org/show/729489/. The overall conclusion is that a
small number of requests seem to be slower than expected and show up in all
the jobs analysis. As a result, the following decisions were made:

   - Create a performance sub-team that meets regularly (once a month in
principle) to review and identify problematic API requests. Miguel Lavalle
will schedule these meetings
   - The performance sub-team will define thresholds for acceptable
response times, will encode them in Rally jobs and fail the jobs if
thresholds are not met
   - The performance sub-team will Improve the tooling to measure
   - Slawek Kaplonski and Miguel Lavalle Slawek will work on a first
version of measurements

QoS Topics

* In the QoS session, the following RFEs and bugs were discussed:

   - [RFE] Does not support shared N-S qos per-tenant,
https://bugs.launchpad.net/neutron/+bug/1787793. A technical solution based
on TC has already been proposed in the RFE itself. The team supported the
implementation of the proposed functionality and the next step is to draft
a spec that will clarify how the feature will be handled from the API point
of view
   - [RFE] Does not support ipv6 N-S qos,
https://bugs.launchpad.net/neutron/+bug/1787792. The team also supported
the implementation of this feature, with the next steps being: 1)
investigate how we can use the Common Classifier Framework to associate QoS
rules with some classifiers and apply it then only for specific class of
traffic - then it will be able to use it for BW limit on FIP and on L2 port
levels as well as for e.g. DSCP marking rules and other things, 2) Make
some PoC of such solution or maybe find if there are other ways to do
something like that, 3) In case of VPN QoS we will also have to implement
some support for classful bw limits in tc driver so it may be reused in
case of this RFE also
   - Instances miss neutron QoS on their ports after unrescue and soft
reboot, https://bugs.launchpad.net/neutron/+bug/1784006. Miguel Lavalle
will investigate this bug, although submitter indicated that it doesn't
affect Nova anymore

CI stability - some jobs are not stable and require multiple rechecks

* Slawek Kaplonski shared with the team on Friday morning a list of tests
and jobs that are unstable. The team decided to distribute those tests and
jobs to work on fixes. This is the list and the comments that came back
from the team:

   - https://bugs.launchpad.net/neutron/+bug/1717302. Problem was traced
back to a wrong version of Keepalived. It is fixed in the master and Rocky
branches. May need a a fix in the Queens branch
   - https://bugs.launchpad.net/neutron/+bug/1789434. Taken over by Manjeet
Singh Bhatia
   - https://bugs.launchpad.net/neutron/+bug/1766701. Fixed with
   - https://bugs.launchpad.net/neutron/+bug/1779075. This bug was taken
over by Miguel Lavalle and will be addressed as part of the new performance
sub-team work mentioned above
   - https://bugs.launchpad.net/neutron/+bug/1726462. This bug doesn't seem
to be caused by Neutron. Miguel Lavalle will ping the Cinder team about it
   - https://bugs.launchpad.net/neutron/+bug/1779077. We haven't hit this
bug lately, according to logstash query in the bug. Miguel Lavalle will
watch it over the next few days
   - https://bugs.launchpad.net/neutron/+bug/1779328. Needs an owner
   - https://bugs.launchpad.net/neutron/+bug/1687027 and
https://bugs.launchpad.net/neutron/+bug/1784836. These bugs are related to
DB migrations and are being investigated by Miguel Lavalle
   - https://bugs.launchpad.net/neutron/+bug/1791989. Taken over by Slawek
   - https://bugs.launchpad.net/neutron/+bug/1779801. After analysis was
done, it was marked as invalid


* Announce removal of FWaaS V1 in Stein

   - German Eichberger will send a message to the mailing list making the

* Miguel Lavalle and Hongbin Lu presented four specs that Huawei proposes
to expand the FWaaS API. The specs were well received by the team, and the
next step is to provide detailed feedback in Gerrit. These specs, along
with the overall feedback provided during the session, are:

   - Add support for dynamic rules: https://review.openstack.org/#/c/597724/
Will review more and get feedback from cores  and Neutron extension with
   - Extend firewall group inclusion:
https://review.openstack.org/#/c/600261/, 'firewall_groups' means
remote_group_id for SG.  Will comment about remote_fwg_id.
   - Introduce action 'redirect': https://review.openstack.org/#/c/600563/
Consider how to specify target(e.g. 3rd-party DPI) and some constraint in
case of 'redirect'
   - Add support for priority: https://review.openstack.org/#/c/600870/
Sync up the use-case for using multiple firewall groups on a port

* Other specs that were considered during the meeting by the team:

   - Firewall audit notification: https://review.openstack.org/461657. Need
to reach out to Zhaobo to synch-up
   - Firewall Rule Scheduling: https://review.openstack.org/236840. Need to
reach out to Zhaobo to synch-up

* German Eichberger suggested evaluating the possibility of deprecating
security groups in Neutron and replace them with FWaaS V2.0

   - The consensus was that that decision cannot be made without a wide
consensus from the community on whether this is desirable / feasible or not
   - German Eichberger will send a message to the ML to gather feedback
from the community

* Other topics discussed during the meeting:

   - Rework Firewall Group status: Sridar Kandaswami will start work on this
   - Tempest tests, including scenario: Will be implemented in this cycle
   - Address group support: implementation work will continue in Stein
   - Zuul v3 testing: Nate Johnston is working on this
   - Scoping of L4-L7 support (WIP spec:
https://review.openstack.org/#/c/600714/): any protocol for L7 filtering by
   - Common Classifier Framework:  requires more investigation. Will synch
up with the CCF team
   - Horizon Technical Debt: Will synch up with Akihiro Motoki and Yushiro
Furukawa will learn AngularJS and Django
   - Remote firewall group support: German Eichberger will continue working
on this during Steinberger German EichbergerGerman EichbergerGerman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180924/f0af86be/attachment-0001.html>

More information about the OpenStack-dev mailing list