[neutron] Slow router provisioning during full resync of L3 Agent

Brian Haley haleyb.dev at gmail.com
Wed Sep 8 19:10:55 UTC 2021


Hi Patryk,

Yes, the re-synchronization of the l3-agent can sometimes be time 
consuming.  A number of things have been added over the years to help 
speed this up, some are in Rocky, some are in later releases.

On 9/8/21 8:48 AM, Patryk Jakuszew wrote:
> Hello,
> 
> I have a long-standing issue with L3 Agent which I would like to
> finally solve - *very* slow router provisioning in L3 Agent.
> 
> We are operating a Rocky-based OpenStack deployment with three
> bare-metal L3 Agent nodes running in legacy mode. After restarting the
> L3 node, it takes a really long time for the L3 agent to become fully
> operational. There are two parts of resync which take much time:
> getting a list of routers from neutron-server and actually recreate
> them in the L3 node.

The first of these, getting info from neutron-server, was initially 
fixed in 2015 and has been enhanced over the years - retrieving routers 
in 'chunks' to reduce the load on neutron-server, since trying to get 
info on 250 routers is a large response to construct.  When this happens 
do you see neutron-server under heavy load?  It might be you need to 
tune the number of RPC workers in this instance to help.

The second has also been slowly improved on the l3-agent side in a 
number of ways, for example, by dynamically increasing worker threads 
when long backlogs occur (not in Rocky).  Other changes like using 
privsep instead of rootwrap has brought the times down slightly as well. 
  There are probably others I'm not thinking of...

> While the long running time of router list retrieval is somewhat
> understandable, the router provisioning process itself proves to be
> very troublesome in our operations tasks. In our production deployment
> with around 250 routers, it takes around 2 hours (!) to recreate the
> router namespaces and have the L3 node fully functional again. Two
> hours of router re-provisioning is actually an optimistic scenario,
> this proved to be much longer during the outages we encountered
> (sometimes the sync took nearly 6-8 hours). This effectively prolongs
> any maintenance upgrades, configuration changes and OpenStack release
> upgrades.
> 
> Another thing is, on that same production environment the first 100
> routers usually get provisioned fast (around 30 minutes), after that
> it slows down with each router - this kind of non deterministic
> behavior makes it hard to communicate the maintenance finish ETA for
> our users.
> 
> We also have a test environment with Stein already installed, where
> this problem is also present - full resync of 150 routers, with only
> one external gateway ports, takes around an hour to complete.
> 
> Are there any operators here who also encountered that issue? Does
> anyone have any experience with similar situation and are willing to
> share their observations and optimizations?

Yes, I know of other operators that have encountered this issue, and the 
community has tried to address it over the years.  It seems you might 
have some of the fixes, but not all of them, and some tuning of worker 
threads might help.

That said, I've never seen sync times in the 6-8 hour range, I wonder if 
the systems in question are under any CPU or memory pressure?  Are there 
any other failures in the logs that show things timing out, like RPC 
failure/retries?

Some other thoughts:

Last year (2020) there were a number of debug messages added to the 
l3-agent that might help pinpoint where time is being spent for each 
router being processed, but that will not be in either of the later 
releases you mentioned.  Maybe if you could install your test 
environment with something much newer it would help resolve or debug the 
issue better?

Using the OVN mechanism driver totally eliminates the l3-agent, but I 
believe you'd need to jump to Victoria (?) in order to use that.

-Brian



More information about the openstack-discuss mailing list