Open Stack

Wed Sep 8 21:34:46 UTC 2021

Hi Brian,

On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev at gmail.com> wrote:

> The first of these, getting info from neutron-server, was initially
> fixed in 2015 and has been enhanced over the years - retrieving routers
> in 'chunks' to reduce the load on neutron-server, since trying to get
> info on 250 routers is a large response to construct.  When this happens
> do you see neutron-server under heavy load?  It might be you need to
> tune the number of RPC workers in this instance to help.
>
> The second has also been slowly improved on the l3-agent side in a
> number of ways, for example, by dynamically increasing worker threads
> when long backlogs occur (not in Rocky).  Other changes like using
> privsep instead of rootwrap has brought the times down slightly as well.
>   There are probably others I'm not thinking of...

In our test environment I noticed that indeed there was a higher CPU
load on neutron-server. I will take a look at both of the options that
you mentioned - recently I've seen some mentions of adjusting RPC
workers to CPU count in order to improve inter-service communication,
but I didn't know about the possibility of switching between privsep
and rootwrap.

> That said, I've never seen sync times in the 6-8 hour range, I wonder if
> the systems in question are under any CPU or memory pressure?  Are there
> any other failures in the logs that show things timing out, like RPC
> failure/retries?

This indeed happened during a full resync caused by a major outage of
the entire RabbitMQ cluster. (Upgrade from 3.6.x to 3.9.x went wrong)

Our control plane runs mostly on VMs, with exception of Neutron
services which run on dedicated physical nodes. During the upgrade we
actually wanted to add more vCPUs to RabbitMQ machines, but after
noticing the control plane instability we rolled back that change. I
will conduct more tests to see how much load is generated during the
resync.

> Some other thoughts:
>
> Last year (2020) there were a number of debug messages added to the
> l3-agent that might help pinpoint where time is being spent for each
> router being processed, but that will not be in either of the later
> releases you mentioned.  Maybe if you could install your test
> environment with something much newer it would help resolve or debug the
> issue better?
>
> Using the OVN mechanism driver totally eliminates the l3-agent, but I
> believe you'd need to jump to Victoria (?) in order to use that.
>
> -Brian

If newer releases have much more debug information available, then it
is definitely worth checking out - I tried gathering some initial
information about duration of certain operations by attaching py-spy
into neutron-l3-agent (https://github.com/benfred/py-spy), but it
didn't actually say how long it took for particular operations to
complete.

As for OVN... I have evaluated it a bit on my private environment
(packstack all-in-one) and while it does have many welcome
improvements like the elimination of separate agent processes, it also
misses a feature that makes it a no-go for our production environment
- neutron-vpnaas support. We have *lots* of users that would not be
happy if we took away neutron-vpnaas. :/

Thank you very much for all the information - now I have some
additional directions to look at.

--
Best regards,
Patryk Jakuszew

On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev at gmail.com> wrote:
>
> Hi Patryk,
>
> Yes, the re-synchronization of the l3-agent can sometimes be time
> consuming.  A number of things have been added over the years to help
> speed this up, some are in Rocky, some are in later releases.
>
> On 9/8/21 8:48 AM, Patryk Jakuszew wrote:
> > Hello,
> >
> > I have a long-standing issue with L3 Agent which I would like to
> > finally solve - *very* slow router provisioning in L3 Agent.
> >
> > We are operating a Rocky-based OpenStack deployment with three
> > bare-metal L3 Agent nodes running in legacy mode. After restarting the
> > L3 node, it takes a really long time for the L3 agent to become fully
> > operational. There are two parts of resync which take much time:
> > getting a list of routers from neutron-server and actually recreate
> > them in the L3 node.
>
> The first of these, getting info from neutron-server, was initially
> fixed in 2015 and has been enhanced over the years - retrieving routers
> in 'chunks' to reduce the load on neutron-server, since trying to get
> info on 250 routers is a large response to construct.  When this happens
> do you see neutron-server under heavy load?  It might be you need to
> tune the number of RPC workers in this instance to help.
>
> The second has also been slowly improved on the l3-agent side in a
> number of ways, for example, by dynamically increasing worker threads
> when long backlogs occur (not in Rocky).  Other changes like using
> privsep instead of rootwrap has brought the times down slightly as well.
>   There are probably others I'm not thinking of...
>
> > While the long running time of router list retrieval is somewhat
> > understandable, the router provisioning process itself proves to be
> > very troublesome in our operations tasks. In our production deployment
> > with around 250 routers, it takes around 2 hours (!) to recreate the
> > router namespaces and have the L3 node fully functional again. Two
> > hours of router re-provisioning is actually an optimistic scenario,
> > this proved to be much longer during the outages we encountered
> > (sometimes the sync took nearly 6-8 hours). This effectively prolongs
> > any maintenance upgrades, configuration changes and OpenStack release
> > upgrades.
> >
> > Another thing is, on that same production environment the first 100
> > routers usually get provisioned fast (around 30 minutes), after that
> > it slows down with each router - this kind of non deterministic
> > behavior makes it hard to communicate the maintenance finish ETA for
> > our users.
> >
> > We also have a test environment with Stein already installed, where
> > this problem is also present - full resync of 150 routers, with only
> > one external gateway ports, takes around an hour to complete.
> >
> > Are there any operators here who also encountered that issue? Does
> > anyone have any experience with similar situation and are willing to
> > share their observations and optimizations?
>
> Yes, I know of other operators that have encountered this issue, and the
> community has tried to address it over the years.  It seems you might
> have some of the fixes, but not all of them, and some tuning of worker
> threads might help.
>
> That said, I've never seen sync times in the 6-8 hour range, I wonder if
> the systems in question are under any CPU or memory pressure?  Are there
> any other failures in the logs that show things timing out, like RPC
> failure/retries?
>
> Some other thoughts:
>
> Last year (2020) there were a number of debug messages added to the
> l3-agent that might help pinpoint where time is being spent for each
> router being processed, but that will not be in either of the later
> releases you mentioned.  Maybe if you could install your test
> environment with something much newer it would help resolve or debug the
> issue better?
>
> Using the OVN mechanism driver totally eliminates the l3-agent, but I
> believe you'd need to jump to Victoria (?) in order to use that.
>
> -Brian

Open Stack

[neutron] Slow router provisioning during full resync of L3 Agent

OpenStack

Community

Documentation

Branding & Legal