[openstack-dev] [Neutron] Train PTG Summary

Miguel Lavalle miguel at mlavalle.com
Sun May 19 21:15:27 UTC 2019


Dear Neutron team,

Thank you very much for your hard work during the PTG in Denver. Even
though it took place at the end of a very long week, we had a very
productive meeting and we planned and prioritized a lot of work to be done
during the cycle. Following below is a high level summary of the
discussions we had. If there is something I left out, please reply to this
email thread to add it. However, if you want to continue the discussion on
any of the individual points summarized below, please start a new thread,
so we don't have a lot of conversations going on attached to this update.
You can find the etherpad we used during the PTG meetings here:
https://etherpad.openstack.org/p/openstack-networking-train-ptg


Retrospective
==========

* The team considered positive points during the Stein cycle the following:

   - Implemented and merged all the targeted blueprints.
   - Minted several new core team members through a mentoring program. The
new core reviewers are Nate Johnston, Hongbin Lu, Liu Yulong, Bernard
Caffarelli (stable branches) and Ryan Tidwell (neutron-dynamic-routing)
   - Very good cross project cooperation with Nova (
https://blueprints.launchpad.net/neutron/+spec/strict-minimum-bandwidth-support)
and StarlingX (
https://blueprints.launchpad.net/neutron/+spec/network-segment-range-management
)
   - The team got caught up with all the community goals
   - Added non-voting jobs from the Stadium projects, enabling the team to
catch potential breakage due to changes in Neutron
   - Successfully forked the Ryu SDN framework, which is used by Neutron
for Openflow programming. The original developer is not supporting the
framework anymore, so the Neutron team forked it as os-ken (
https://opendev.org/openstack/os-ken) and adopted it seamlessly in the code

* The team considered the following as improvement areas:

   - At the end of the cycle, we experienced failures in the gate that
impacted the speed at which code was merged. Measures to solve this problem
were later discussed in the "CI stability" section below
   - The team didn't make much progress adopting Storyboard. Given comments
of lack of functionality from other projects, a decision was made to
evaluate progress by other teams before moving ahead with Storyboard
   - Lost almost all the key contributors in the following Stadium
projects: https://opendev.org/openstack/neutron-fwaas and
https://opendev.org/openstack/neutron-vpnaas. Miguel Lavalle will talk to
the remaining contributors to asses how to move forward
   - Not too much concrete progress was achieved by the performance and
scalability sub-team. Please see the "Neutron performance and scaling up"
section below for next steps
   - Engine facade adoption didn't make much progress due to the loss of
all the members of the sub-team working on it. Miguel Lavalle will lead
this effort during Train. Nate Johnston and Rodolfo Alonso volunteered to
help. The approach will be to break up this patch into smaller, more easily
implementable and reviewable chunks: https://review.opendev.org/#/c/545501/


Support more than one segment per network per host
========================================

The basic value proposition of routed networks is to allow deployers to
offer their users a single "big virtual network" without the performance
limitations of large L2 broadcast domains. This value proposition is
currently limited by the fact that Neutron allows only one segment per
network per host:
https://github.com/openstack/neutron/blob/77fa7114f9ff67d43a1150b52001883fafb7f6c8/neutron/objects/subnet.py#L319-L328.
As a consequence, as demand of IP addresses exceeds the limits of a
reasonably sized subnets (/22 subnets is a consensus on the upper limit),
it becomes necessary to allow hosts to be connected to more than one
segment in a routed network.

David Bingham and Kris Lindgren (GoDaddy) have been working on PoC code to
implement this (https://review.opendev.org/#/c/623115). This code has
helped to uncover some key challenges:

* Change all code that assumes a 1-1 relationship between network and
segment per host into a 1-many relationship.
* Generate IDs based on segment_id rather than network_id to be used in
naming software bridges associated with the network segments.
* Ensure new 1-many relationship (network -> segment) can be supported by
ML2 drivers implementors.
* Provide migration paths for current deployments of routed networks.

The agreements made were the following:

* We will write a spec reflecting the learnings of the PoC
* The spec will target all the supported ML2 backends, not only some of them
* We will modify and update ML2 interfaces to support the association of
software bridges with segments, striving to provide backwards compatibility
* We will try to provide an automatic migration option that only requires
re-starting the agents. If that proves not to be possible, a set of
migration scripts and detailed instructions will be created

The first draft of the spec is already up for review:
https://review.opendev.org/#/c/657170/


Neutron CI stability
==============

At the end of the Stein cycle the project experienced a significant impact
due to CI instability. This situation has improved recently but there is
still gains to be achieved. The team discussed to major areas of
improvement: make sure we don't have more tests that are necessary
(simplification of jobs) and fix recurring problems.

- To help the conversation on simplification of jobs, Slawek Kaplonski
shared this matrix showing what currently is being tested:
https://ethercalc.openstack.org/neutron-ci-jobs

   * One approach is reducing the services Neutron is tested with
in integrated-gate jobs (tempest-full), which will reduce the number of
failures not related to Neutron. Slawek Kaplonski represented Neutron in
the QA PTG session where this approach was discussed. The proposal being
socialized in the mailing list (
http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005871.html
) involves:

      # Run only dependent service tests on project gate
      # Tempest gate will keep running all the services tests as the
integrated gate at a centeralized  place without any change in the current
job
      # Each project can run a simplified integrated gate job template
tailored to its needs
      # All the simplified integrated gate job templates will be defined
and maintained by QA team
      # For Neutron there will be an "Integrated-gate-networking". Tests to
run in this template: Neutron APIs , Nova APIs,  Keystone APIs. All
scenario currently running in tempest-full in the same way (means non-slow
and in serial). The improvement will be to exclude the Cinder API tests,
Glance API tests and Swift API tests

   * Another idea discussed was removing single node jobs that are very
similar to multinode jobs

      # One possibility is consolidating grenade jobs. There is a proposal
being socialized in the mailing list:
http://lists.openstack.org/pipermail/openstack-discuss/2019-May/006146.html
      # Other consolidation of single node - multinode jobs will require
stabilizing the corresponding multinode job

- One common source of problems is ssh failures in various scenario tests

   * Several team members are working on different aspects of this issue
   * Slawek Kaplonski is investigating authentication failures. As of the
writing of this summary, it has been determined that there is a slowdown in
the metadata service, either on the Neutron or the Nova side. Further
investigation is going on
   * Miguel Lavalle is adding tcpdump commands to router namespaces to
investigate data path disruptions


netwroking-omnipath
================
networking-omnipath (https://opendev.org/x/networking-omnipath) is a ML2
mechanism driver that integrates OpenStack Neutron API with an Omnipath
backend. It enables Omnipath switching fabric in OpenStack cloud and each
network in the Openstack networking realm corresponds to a virtual fabric
on the omnipath side.

   - Manjeet Singh Bhatia proposed to make networking-omnipath a Neutron
Stadium project
   - The agreement was that Miguel Lavalle and Manjeet will work together
in determining whether networking-omnipath meet the Stadium project
criteria, as outlined here:
https://docs.openstack.org/neutron/latest/contributor/stadium/governance.html#when-is-a-project-considered-part-of-the-stadium
   - In case the criteria is not met, a remediation plan will be defined


Cross networking project topics
=======================

- Cross networking project topics

   * Neutron the only projects not using WSGI
   * We have to make it the default option in DevStack, although this will
require some investigation
   * We already have a check queue non-voting job for WSGI. It is failing
constantly, although the failures are all due to a singe test case
(neutron_add_remove_fixed_ip). Miguel Lavalle will investigate and fix it
   * Target is to adopt WSGI as the default by Train-2

- Adoption of neutron-classifier (https://opendev.org/x/neutron-classifier)

   * David Shaughnessy has two PoC patches that demonstrate the adoption of
neutron-classifier into Neutron's QoS. David will continue refining these
patches and will bring them up for discussion in the QoS sub-team meeting
on May 14th
   * It was also agreed to start the process of adding neutron-classifier
the the Neutron Stadium. David Shaughnessy and Miguel Lavalle will work on
this per the criteria defined in
https://docs.openstack.org/neutron/latest/contributor/stadium/governance.html#when-is-a-project-considered-part-of-the-stadium

- DHCP agent configured with mismatching domain and host entries

   * Since the merge of https://review.opendev.org/#/c/571546/, we have a
confusion about what exactly the dns_domain field of a network is for.
Historically, dns_domain for use with external DNS integration in the form
of designate, but that delineation has become muddied with the previously
mentioned change.
* Miguel Lavalle will go back to the original spec of DNS integration and
make a decision as to how to move forward

- Neutron Events for smartNIC-baremetal use-case

* In smartNIC baremetal usecase, Ironic need to know when agent is/is-not
alive (since the neutron agent is running on the smartNIC) and when a port
becomes up/down
* General agreement was to leverage the existing notifier mechanism to emit
this information for Ironic to consume (requires implementation of an API
endpoint in Ironic). It was also agreed that a spec will be created
* The notifications emitted can be leveraged by Ironic for other use-cases.
In fact, in a lunch with Ironic team members (Julia Kreger, Devananda van
der Veen and Harald Jensås), it was agreed to use use it also for the port
bind/unbind completed notification.


Neutron performance and scaling up
===========================

- Recently, a performance and scalability sub-team (
http://eavesdrop.openstack.org/#Neutron_Performance_sub-team_Meeting) has
been formed to explore ways to improve performance overall
- One of the activities of this sub-team has been adding osprofiler to the
Neutron Rally job (https://review.opendev.org/#/c/615350). Sample result
reports can be seen here:
http://logs.openstack.org/50/615350/38/check/neutron-rally-task/0a4b791/results/report.html.gz#/NeutronNetworks.create_and_delete_ports/output
 and
http://logs.openstack.org/50/615350/38/check/neutron-rally-task/0a4b791/results/report.html.gz#/NeutronNetworks.create_and_delete_subnets/output
- Reports analysis indicate that port creation takes on average in the
order of 9 seconds, even without assigning IP addresses to it and without
binding it. The team decided to concentrate its efforts in improving the
entire port creation - binding - wiring cycle. One step necessary for this
is the addition of a Rally scenario, which Bence Romsics volunteered to
develop.
- Another area of activity has been EnOS (
https://github.com/BeyondTheClouds/enos ), which is a framework that
deploys OpenStack (using Kolla Ansible) and then runs Rally based
performance experiments on that deployment (
https://enos.readthedocs.io/en/stable/benchmarks/index.html)

   * The deployment can take place on VMs (Vagrant provider) or in large
clouds such as Grid5000 testbed: https://www.grid5000.fr/w/Grid5000:Home
   * The Neutron performance sub-team and the EnOS developers are
cooperating to define a performance experiment at scale
   * To that end, Miguel Lavalle has built a "big PC" with an AMD
Threadripper 2950x processor (16 cores, 32 threads) and 64 GB of memory.
This machine will be used to experiment with deployments in VMs to refine
the scenarios to be tested, with the additional benefit that the Rally
results will not be affected by variability in the OpenStack CI
infrastructure.
   * Once the Neutron and EnOS team reach an agreement on the scenarios to
be tested, an experiment will be run Grid5000
   * The EnOS team delivered on May 6th the version that supports the Stein
release

- Miguel Lavalle will create a wiki page to record a performance baseline
and track subsequent progress


DVR Enhancements
===============

- Supporting allowed_address_pairs for DVR is a longstanding issue for DVR:
https://bugs.launchpad.net/neutron/+bug/1774459. There are to patches up
for review to address this issue:

   * https://review.opendev.org/#/c/616272/
   * https://review.opendev.org/#/c/651905/

- The team discussed the current state of DVR functionality and identified
the following missing features that would be beneficial  for operators:

   * Distributed ingress/egress for IPv6. Distributed ingress/egress (AKA
"fast-exit") would be implemented for IPv6. This would bypass the
centralized router in a network node
   * Support for smart-NIC offloads. This involves pushing all DVR
forwarding policy into OVS and implementing it via OpenFlow
   * Distributed DHCP. Rather than having DHCP for a given network be
answered centrally, OpenFlow rules will be programmed into OVS to provide
static, locally generated responses to DHCP requests received on br-int
   * Distributed SNAT. This involves allowing SNAT to happen directly on
the compute node instead of centralizing it on a network node.
   * There was agreement that these features are needed and Ryan Tidwell
agreed to develop a spec as the next step. The spec is already up for
review: https://review.opendev.org/#/c/658414

- networking-ovn team members pointed out that some of the above features
are already implemented in their Stadium project. This led to the
discussion of why duplicate efforts implementing the same features and
instead explore the possibility of a convergence between the ML2 / agents
based reference implementation and the ML2 / OVN implementation.

   * This discussion is particularly relevant in the context where the
OpenStack community is rationalizing its size and contributors are scarcer
   * Such a convergence would most likely play out over several development
cycles
   * The team agreed to explore how to achieve this convergence. To move
forward, we will need visibility and certainty that the following is
feasible:

      # Feature parity with what the reference implementation offers today
      # Ability to support all the backends in the current reference
implementation
      # Offer verifiable substantial gains in performance and scalability
compared to the current reference implementation
      # Broaden the community of developers contributing to the ML2 / OVN
implementation

   * To move ahead in the exploration of this convergence, three actions
were agreed:

      # Benchmarking of the two implementations will be carried out with
EnOS, as part of the performance and scaling up activities described above
      # Write the necessary specs to address feature parity, support all
the backends in the current reference implementation and provide migration
paths
      # An item will be added to the weekly Neutron meeting to track
progress
      # Make every decision along this exploration process with approval of
the broader community


Policy topics / API
==============

- Keystone has a now a system scope. A system-scoped token implies the user
has authorization to act on the deployment system. These tokens are useful
for interacting with resources that affect the deployment as a whole, or
exposes resources that may otherwise violate project or domain isolation

   * Currently in Neutron, if users have an admin role they, can access all
the resources
   * In order to maintain alignment with the community. Akihiro Motoki will
review the Neutron code and determine how the admin role is used to
interact with deployment resources. He will also monitor how Nova's
adoption of the system scope progresses

- During the policy-in-code work, some improvements and clean ups were left
pending, which are Items 2.3, 2.4 and 4 in
https://etherpad.openstack.org/p/neutron-policy-in-code

- The Neutron approach to use new extensions to make any change to the ReST
API discoverable, has resulted in the proliferation of "shim extensions" to
introduce small changes such as the addition of an attribute

   * To eliminate this issue, Akihiro Motoki proposed to maintain the
overall extensions approach but micro version the extensions so that every
feature added does not result in another extension
   * The counter argument from some in the room was: "extensions are messy,
but it's a static mess. Putting in Micro versions creates a mess in the
code with lots of conditionals on micro version enablement"
   * It was decided to explore simple versioning of extensions. The details
will be fleshed out in the following spec:
https://review.opendev.org/#/c/656955


Neutron - Nova cross project planning
=============================

This session was summarized in the following messages to the mailing list:

-
http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005844.html
summarizes
the following topics

   * Optional NUMA affinity for neutron ports
   * Track neutron ports in placement
   * os-vif to be extended to contain new fields to record the connectiviy
type and ml2 driver that bound the vif
   * Boot vms with unaddressed port

- Leaking resources when ports are deleted out-of-band is summarized in
this thread:
http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005837.html

- Melanie Witt asked if Neutron would support implementing transferring
ownership of its resources. The answer was yes and as next step, she is
going to send a message to the mailing list to define the next steps


Code streamlining proposals
======================

- Streamlining IPAM flow. As a result of bulk port creation work done in
Stein by Nate Johnston, it is clear that there are opportunities to improve
the IPAM code. The team brainstormed several possible approaches and the
following proposals were put forward:

   * When allocating blocks of IP addresses where strategy is 'next ip'
then ensure it happens as a single SQL insert
   * Create bulk versions of allocate_ip_from_port_and_store etc. so that
bulk can be preserved when pushed down to the IPAM driver to take advantage
of the previous item
   * Add profiling code to the IPAM call so that we can log the time
duration for IPAM execution, as a PoC

- Streamlining use of remote groups in security groups. Nate Johnston
pointed out that there is a performance hit when using security groups that
are keyed to a remote_group_id, because when a port is added to a remote
group it triggers security group rule updates for all of the members of the
security group. On deployments with 150+ ports, this can take up to 5 mins
to bring up the port

   * After discussion, the proposed next step is for Nate Johnston to
create a PoC for a new approach where a nested security group creates a new
iptables table/ovs flow table (let's call it a subtable) that can be used
as an abstraction for the nested group relationship.  Then the IP addresses
of the primary security group will jump to the new table, and the new table
can represent the contents of the remote security group

      # In a case where a primary security group has 170 members and lists
itself as a remote security group (indicating members can all talk amongst
themselves) when adding an instance to the security group that causes 171
updates, since each member needs the address of the new instance and a
record needs to be created for the new one
      # With the proposed approach there would only be 2 updates: creating
an entry for the new instance to jump to the subtable representing the
remote security group, and adding an entry to the subtable


Train community goals
=================

The two community goals accepted or Train are:

- PDF doc generation for project docs:
https://review.opendev.org/#/c/647712/

   * Akihiro Motoki will track this goal

- IPv6 support and testing goal: https://review.opendev.org/#/c/653545/

   * Good blog entry on overcoming metadata service shortcomings in this
scenario:
https://superuser.openstack.org/articles/deploying-ipv6-only-tenants-with-openstack/


neutron-lib topics
=============

- To help expedite the merging the of neutron-lib consumption patches it
was proposed to the team that neutron-lib-current projects must get their
dependencies for devstack based testing jobs from source, instead of PyPI.

   * For an example of an incident motivating this proposal, please see:
https://bugs.launchpad.net/tricircle/+bug/1816644
   * This refers to inter-project dependencies, for example networking-l2gw
depending on networking-sfc. It does not apply to *-lib projects, those
will still be based on PyPI release
   * The team agreed to this proposal
   * When creating a new stable branch the Zuul config would need to be
updated to point to the stable releases of the other projects it depends
on. May include a periodic job that involves testing master and stable
branches against PyPI packages
   * Boden Russel will make a list of what jobs need to be updated in
projects that consume neutron-lib (superset of stadium)

- Boden reminded the team we have a work items list for neutron-lib
https://etherpad.openstack.org/p/neutron-lib-volunteers-and-punch-list


Requests for enhancement
=====================

- Improve extraroute API

   * Current extraroute API does not allow atomic additions/deletions of
particular routing table entries. In the current API the routes attribute
of a router (containing all routing table entries) must be updated at once,
leading to race conditions on the client side
   * The team debated several alternatives: an API extension that makes
routers extra routes first level resources, solve the concurrency issue
though "compare and swap" approach, seek input from API work group or
provide better guidelines for the use of the current API
   * The decision was made to move ahead with a spec proposing extra routes
as first level API resources. That spec can be found here:
https://review.opendev.org/#/c/655680

- Decouple placement reporting service plugin from ML2

   * The placement reporter service plugin as merged in Stein depends on
ML2. The improvement idea is to decouple it, by a driver pattern as in the
QoS service plugin
   * We currently don't have a use case for this decoupling. As a
consequence, it was decided to postpone it


Various topics
==========

- Migration of stadium projects CI jobs to python 3

   * We have an etherpad recording the work items:
https://etherpad.openstack.org/p/neutron_stadium_python3_status
   * Lajos Katona will take care of networking-odl
   * Miguel Lavalle will talk to Takashi Yamamoto about networking-midonet
   * Nate Johnston will continue working on networking-bagpipe and
neutron-fwaas patches
   * A list of projects beyond the Stadium will be collected as part of the
effort for neutron-lib to start pulling requirements from source

- Removal of deprecated "of_interface" option

   * The option was deprecated in Pike
   * In some cases, deployers might experience a few seconds of data plane
down time when the OVS agent is restarted without the option
   * A message was sent to the ML warning of this possible effect:
http://lists.openstack.org/pipermail/openstack-dev/2018-September/134938.html.
There has been no reaction from the community
   * We will move ahead with the removal of the option. Patch is here:
https://review.opendev.org/#/c/599496


Status and health of some Stadium and non-Stadium projects
==============================================

- Some projects have experienced loss of development team:

   * networking-old. In this case, Ericsson is interested in continuing
maintaining the project. The key contact is Lajos Katona
   * networking-l2gw is also interesting for Ericsson (Lajos Katona). Over
the pas few cycles the project has been maintained by Ricardo Noriega of
Red Hat. Miguel Lavalle will organize a meeting with Lajos and Ricardo to
decide how to move ahead with this project
   * neutron-fwaas. In this case, Miguel Lavalle will send a message to the
mailing list describing the status of the project and requesting parties
interested in continuing maintaining the project
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190519/e2a74c4f/attachment-0001.html>


More information about the openstack-discuss mailing list