[neutron] Slow router provisioning during full resync of L3 Agent

Patryk Jakuszew patryk.jakuszew at gmail.com
Wed Sep 8 12:48:50 UTC 2021


Hello,

I have a long-standing issue with L3 Agent which I would like to
finally solve - *very* slow router provisioning in L3 Agent.

We are operating a Rocky-based OpenStack deployment with three
bare-metal L3 Agent nodes running in legacy mode. After restarting the
L3 node, it takes a really long time for the L3 agent to become fully
operational. There are two parts of resync which take much time:
getting a list of routers from neutron-server and actually recreate
them in the L3 node.

While the long running time of router list retrieval is somewhat
understandable, the router provisioning process itself proves to be
very troublesome in our operations tasks. In our production deployment
with around 250 routers, it takes around 2 hours (!) to recreate the
router namespaces and have the L3 node fully functional again. Two
hours of router re-provisioning is actually an optimistic scenario,
this proved to be much longer during the outages we encountered
(sometimes the sync took nearly 6-8 hours). This effectively prolongs
any maintenance upgrades, configuration changes and OpenStack release
upgrades.

Another thing is, on that same production environment the first 100
routers usually get provisioned fast (around 30 minutes), after that
it slows down with each router - this kind of non deterministic
behavior makes it hard to communicate the maintenance finish ETA for
our users.

We also have a test environment with Stein already installed, where
this problem is also present - full resync of 150 routers, with only
one external gateway ports, takes around an hour to complete.

Are there any operators here who also encountered that issue? Does
anyone have any experience with similar situation and are willing to
share their observations and optimizations?

--
Regards,
Patryk Jakuszew



More information about the openstack-discuss mailing list