[neutron] Slow router provisioning during full resync of L3 Agent

Slawek Kaplonski skaplons at redhat.com
Thu Sep 9 07:19:14 UTC 2021


Hi,

On środa, 8 września 2021 23:34:46 CEST Patryk Jakuszew wrote:
> Hi Brian,
> 
> On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev at gmail.com> wrote:
> > The first of these, getting info from neutron-server, was initially
> > fixed in 2015 and has been enhanced over the years - retrieving routers
> > in 'chunks' to reduce the load on neutron-server, since trying to get
> > info on 250 routers is a large response to construct.  When this happens
> > do you see neutron-server under heavy load?  It might be you need to
> > tune the number of RPC workers in this instance to help.
> > 
> > The second has also been slowly improved on the l3-agent side in a
> > number of ways, for example, by dynamically increasing worker threads
> > when long backlogs occur (not in Rocky).  Other changes like using
> > privsep instead of rootwrap has brought the times down slightly as well.
> > 
> >   There are probably others I'm not thinking of...
> 
> In our test environment I noticed that indeed there was a higher CPU
> load on neutron-server. I will take a look at both of the options that
> you mentioned - recently I've seen some mentions of adjusting RPC
> workers to CPU count in order to improve inter-service communication,
> but I didn't know about the possibility of switching between privsep
> and rootwrap.
> 
> > That said, I've never seen sync times in the 6-8 hour range, I wonder if
> > the systems in question are under any CPU or memory pressure?  Are there
> > any other failures in the logs that show things timing out, like RPC
> > failure/retries?
> 
> This indeed happened during a full resync caused by a major outage of
> the entire RabbitMQ cluster. (Upgrade from 3.6.x to 3.9.x went wrong)
> 
> Our control plane runs mostly on VMs, with exception of Neutron
> services which run on dedicated physical nodes. During the upgrade we
> actually wanted to add more vCPUs to RabbitMQ machines, but after
> noticing the control plane instability we rolled back that change. I
> will conduct more tests to see how much load is generated during the
> resync.
> 
> > Some other thoughts:
> > 
> > Last year (2020) there were a number of debug messages added to the
> > l3-agent that might help pinpoint where time is being spent for each
> > router being processed, but that will not be in either of the later
> > releases you mentioned.  Maybe if you could install your test
> > environment with something much newer it would help resolve or debug the
> > issue better?
> > 
> > Using the OVN mechanism driver totally eliminates the l3-agent, but I
> > believe you'd need to jump to Victoria (?) in order to use that.
> > 
> > -Brian
> 
> If newer releases have much more debug information available, then it
> is definitely worth checking out - I tried gathering some initial
> information about duration of certain operations by attaching py-spy
> into neutron-l3-agent (https://github.com/benfred/py-spy), but it
> didn't actually say how long it took for particular operations to
> complete.
> 
> As for OVN... I have evaluated it a bit on my private environment
> (packstack all-in-one) and while it does have many welcome
> improvements like the elimination of separate agent processes, it also
> misses a feature that makes it a no-go for our production environment
> - neutron-vpnaas support. We have *lots* of users that would not be
> happy if we took away neutron-vpnaas. :/

Support for vpnaas in OVN backend is already reported as RFE: https://
bugs.launchpad.net/neutron/+bug/1905391 - unfortunatelly that work stopped 
some time ago and there is no progress now. But maybe You would have time and 
want to help with it - any help is welcome :)

> 
> Thank you very much for all the information - now I have some
> additional directions to look at.
> 
> --
> Best regards,
> Patryk Jakuszew
> 
> On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev at gmail.com> wrote:
> > Hi Patryk,
> > 
> > Yes, the re-synchronization of the l3-agent can sometimes be time
> > consuming.  A number of things have been added over the years to help
> > speed this up, some are in Rocky, some are in later releases.
> > 
> > On 9/8/21 8:48 AM, Patryk Jakuszew wrote:
> > > Hello,
> > > 
> > > I have a long-standing issue with L3 Agent which I would like to
> > > finally solve - *very* slow router provisioning in L3 Agent.
> > > 
> > > We are operating a Rocky-based OpenStack deployment with three
> > > bare-metal L3 Agent nodes running in legacy mode. After restarting the
> > > L3 node, it takes a really long time for the L3 agent to become fully
> > > operational. There are two parts of resync which take much time:
> > > getting a list of routers from neutron-server and actually recreate
> > > them in the L3 node.
> > 
> > The first of these, getting info from neutron-server, was initially
> > fixed in 2015 and has been enhanced over the years - retrieving routers
> > in 'chunks' to reduce the load on neutron-server, since trying to get
> > info on 250 routers is a large response to construct.  When this happens
> > do you see neutron-server under heavy load?  It might be you need to
> > tune the number of RPC workers in this instance to help.
> > 
> > The second has also been slowly improved on the l3-agent side in a
> > number of ways, for example, by dynamically increasing worker threads
> > when long backlogs occur (not in Rocky).  Other changes like using
> > privsep instead of rootwrap has brought the times down slightly as well.
> > 
> >   There are probably others I'm not thinking of...
> >   
> > > While the long running time of router list retrieval is somewhat
> > > understandable, the router provisioning process itself proves to be
> > > very troublesome in our operations tasks. In our production deployment
> > > with around 250 routers, it takes around 2 hours (!) to recreate the
> > > router namespaces and have the L3 node fully functional again. Two
> > > hours of router re-provisioning is actually an optimistic scenario,
> > > this proved to be much longer during the outages we encountered
> > > (sometimes the sync took nearly 6-8 hours). This effectively prolongs
> > > any maintenance upgrades, configuration changes and OpenStack release
> > > upgrades.
> > > 
> > > Another thing is, on that same production environment the first 100
> > > routers usually get provisioned fast (around 30 minutes), after that
> > > it slows down with each router - this kind of non deterministic
> > > behavior makes it hard to communicate the maintenance finish ETA for
> > > our users.
> > > 
> > > We also have a test environment with Stein already installed, where
> > > this problem is also present - full resync of 150 routers, with only
> > > one external gateway ports, takes around an hour to complete.
> > > 
> > > Are there any operators here who also encountered that issue? Does
> > > anyone have any experience with similar situation and are willing to
> > > share their observations and optimizations?
> > 
> > Yes, I know of other operators that have encountered this issue, and the
> > community has tried to address it over the years.  It seems you might
> > have some of the fixes, but not all of them, and some tuning of worker
> > threads might help.
> > 
> > That said, I've never seen sync times in the 6-8 hour range, I wonder if
> > the systems in question are under any CPU or memory pressure?  Are there
> > any other failures in the logs that show things timing out, like RPC
> > failure/retries?
> > 
> > Some other thoughts:
> > 
> > Last year (2020) there were a number of debug messages added to the
> > l3-agent that might help pinpoint where time is being spent for each
> > router being processed, but that will not be in either of the later
> > releases you mentioned.  Maybe if you could install your test
> > environment with something much newer it would help resolve or debug the
> > issue better?
> > 
> > Using the OVN mechanism driver totally eliminates the l3-agent, but I
> > believe you'd need to jump to Victoria (?) in order to use that.
> > 
> > -Brian


-- 
Slawek Kaplonski
Principal Software Engineer
Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210909/3850c953/attachment.sig>


More information about the openstack-discuss mailing list