[neutron] Slow router provisioning during full resync of L3 Agent
Hello, I have a long-standing issue with L3 Agent which I would like to finally solve - *very* slow router provisioning in L3 Agent. We are operating a Rocky-based OpenStack deployment with three bare-metal L3 Agent nodes running in legacy mode. After restarting the L3 node, it takes a really long time for the L3 agent to become fully operational. There are two parts of resync which take much time: getting a list of routers from neutron-server and actually recreate them in the L3 node. While the long running time of router list retrieval is somewhat understandable, the router provisioning process itself proves to be very troublesome in our operations tasks. In our production deployment with around 250 routers, it takes around 2 hours (!) to recreate the router namespaces and have the L3 node fully functional again. Two hours of router re-provisioning is actually an optimistic scenario, this proved to be much longer during the outages we encountered (sometimes the sync took nearly 6-8 hours). This effectively prolongs any maintenance upgrades, configuration changes and OpenStack release upgrades. Another thing is, on that same production environment the first 100 routers usually get provisioned fast (around 30 minutes), after that it slows down with each router - this kind of non deterministic behavior makes it hard to communicate the maintenance finish ETA for our users. We also have a test environment with Stein already installed, where this problem is also present - full resync of 150 routers, with only one external gateway ports, takes around an hour to complete. Are there any operators here who also encountered that issue? Does anyone have any experience with similar situation and are willing to share their observations and optimizations? -- Regards, Patryk Jakuszew
Hi Patryk, Yes, the re-synchronization of the l3-agent can sometimes be time consuming. A number of things have been added over the years to help speed this up, some are in Rocky, some are in later releases. On 9/8/21 8:48 AM, Patryk Jakuszew wrote:
Hello,
I have a long-standing issue with L3 Agent which I would like to finally solve - *very* slow router provisioning in L3 Agent.
We are operating a Rocky-based OpenStack deployment with three bare-metal L3 Agent nodes running in legacy mode. After restarting the L3 node, it takes a really long time for the L3 agent to become fully operational. There are two parts of resync which take much time: getting a list of routers from neutron-server and actually recreate them in the L3 node.
The first of these, getting info from neutron-server, was initially fixed in 2015 and has been enhanced over the years - retrieving routers in 'chunks' to reduce the load on neutron-server, since trying to get info on 250 routers is a large response to construct. When this happens do you see neutron-server under heavy load? It might be you need to tune the number of RPC workers in this instance to help. The second has also been slowly improved on the l3-agent side in a number of ways, for example, by dynamically increasing worker threads when long backlogs occur (not in Rocky). Other changes like using privsep instead of rootwrap has brought the times down slightly as well. There are probably others I'm not thinking of...
While the long running time of router list retrieval is somewhat understandable, the router provisioning process itself proves to be very troublesome in our operations tasks. In our production deployment with around 250 routers, it takes around 2 hours (!) to recreate the router namespaces and have the L3 node fully functional again. Two hours of router re-provisioning is actually an optimistic scenario, this proved to be much longer during the outages we encountered (sometimes the sync took nearly 6-8 hours). This effectively prolongs any maintenance upgrades, configuration changes and OpenStack release upgrades.
Another thing is, on that same production environment the first 100 routers usually get provisioned fast (around 30 minutes), after that it slows down with each router - this kind of non deterministic behavior makes it hard to communicate the maintenance finish ETA for our users.
We also have a test environment with Stein already installed, where this problem is also present - full resync of 150 routers, with only one external gateway ports, takes around an hour to complete.
Are there any operators here who also encountered that issue? Does anyone have any experience with similar situation and are willing to share their observations and optimizations?
Yes, I know of other operators that have encountered this issue, and the community has tried to address it over the years. It seems you might have some of the fixes, but not all of them, and some tuning of worker threads might help. That said, I've never seen sync times in the 6-8 hour range, I wonder if the systems in question are under any CPU or memory pressure? Are there any other failures in the logs that show things timing out, like RPC failure/retries? Some other thoughts: Last year (2020) there were a number of debug messages added to the l3-agent that might help pinpoint where time is being spent for each router being processed, but that will not be in either of the later releases you mentioned. Maybe if you could install your test environment with something much newer it would help resolve or debug the issue better? Using the OVN mechanism driver totally eliminates the l3-agent, but I believe you'd need to jump to Victoria (?) in order to use that. -Brian
Hi Brian, On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev@gmail.com> wrote:
The first of these, getting info from neutron-server, was initially fixed in 2015 and has been enhanced over the years - retrieving routers in 'chunks' to reduce the load on neutron-server, since trying to get info on 250 routers is a large response to construct. When this happens do you see neutron-server under heavy load? It might be you need to tune the number of RPC workers in this instance to help.
The second has also been slowly improved on the l3-agent side in a number of ways, for example, by dynamically increasing worker threads when long backlogs occur (not in Rocky). Other changes like using privsep instead of rootwrap has brought the times down slightly as well. There are probably others I'm not thinking of...
In our test environment I noticed that indeed there was a higher CPU load on neutron-server. I will take a look at both of the options that you mentioned - recently I've seen some mentions of adjusting RPC workers to CPU count in order to improve inter-service communication, but I didn't know about the possibility of switching between privsep and rootwrap.
That said, I've never seen sync times in the 6-8 hour range, I wonder if the systems in question are under any CPU or memory pressure? Are there any other failures in the logs that show things timing out, like RPC failure/retries?
This indeed happened during a full resync caused by a major outage of the entire RabbitMQ cluster. (Upgrade from 3.6.x to 3.9.x went wrong) Our control plane runs mostly on VMs, with exception of Neutron services which run on dedicated physical nodes. During the upgrade we actually wanted to add more vCPUs to RabbitMQ machines, but after noticing the control plane instability we rolled back that change. I will conduct more tests to see how much load is generated during the resync.
Some other thoughts:
Last year (2020) there were a number of debug messages added to the l3-agent that might help pinpoint where time is being spent for each router being processed, but that will not be in either of the later releases you mentioned. Maybe if you could install your test environment with something much newer it would help resolve or debug the issue better?
Using the OVN mechanism driver totally eliminates the l3-agent, but I believe you'd need to jump to Victoria (?) in order to use that.
-Brian
If newer releases have much more debug information available, then it is definitely worth checking out - I tried gathering some initial information about duration of certain operations by attaching py-spy into neutron-l3-agent (https://github.com/benfred/py-spy), but it didn't actually say how long it took for particular operations to complete. As for OVN... I have evaluated it a bit on my private environment (packstack all-in-one) and while it does have many welcome improvements like the elimination of separate agent processes, it also misses a feature that makes it a no-go for our production environment - neutron-vpnaas support. We have *lots* of users that would not be happy if we took away neutron-vpnaas. :/ Thank you very much for all the information - now I have some additional directions to look at. -- Best regards, Patryk Jakuszew On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev@gmail.com> wrote:
Hi Patryk,
Yes, the re-synchronization of the l3-agent can sometimes be time consuming. A number of things have been added over the years to help speed this up, some are in Rocky, some are in later releases.
On 9/8/21 8:48 AM, Patryk Jakuszew wrote:
Hello,
I have a long-standing issue with L3 Agent which I would like to finally solve - *very* slow router provisioning in L3 Agent.
We are operating a Rocky-based OpenStack deployment with three bare-metal L3 Agent nodes running in legacy mode. After restarting the L3 node, it takes a really long time for the L3 agent to become fully operational. There are two parts of resync which take much time: getting a list of routers from neutron-server and actually recreate them in the L3 node.
The first of these, getting info from neutron-server, was initially fixed in 2015 and has been enhanced over the years - retrieving routers in 'chunks' to reduce the load on neutron-server, since trying to get info on 250 routers is a large response to construct. When this happens do you see neutron-server under heavy load? It might be you need to tune the number of RPC workers in this instance to help.
The second has also been slowly improved on the l3-agent side in a number of ways, for example, by dynamically increasing worker threads when long backlogs occur (not in Rocky). Other changes like using privsep instead of rootwrap has brought the times down slightly as well. There are probably others I'm not thinking of...
While the long running time of router list retrieval is somewhat understandable, the router provisioning process itself proves to be very troublesome in our operations tasks. In our production deployment with around 250 routers, it takes around 2 hours (!) to recreate the router namespaces and have the L3 node fully functional again. Two hours of router re-provisioning is actually an optimistic scenario, this proved to be much longer during the outages we encountered (sometimes the sync took nearly 6-8 hours). This effectively prolongs any maintenance upgrades, configuration changes and OpenStack release upgrades.
Another thing is, on that same production environment the first 100 routers usually get provisioned fast (around 30 minutes), after that it slows down with each router - this kind of non deterministic behavior makes it hard to communicate the maintenance finish ETA for our users.
We also have a test environment with Stein already installed, where this problem is also present - full resync of 150 routers, with only one external gateway ports, takes around an hour to complete.
Are there any operators here who also encountered that issue? Does anyone have any experience with similar situation and are willing to share their observations and optimizations?
Yes, I know of other operators that have encountered this issue, and the community has tried to address it over the years. It seems you might have some of the fixes, but not all of them, and some tuning of worker threads might help.
That said, I've never seen sync times in the 6-8 hour range, I wonder if the systems in question are under any CPU or memory pressure? Are there any other failures in the logs that show things timing out, like RPC failure/retries?
Some other thoughts:
Last year (2020) there were a number of debug messages added to the l3-agent that might help pinpoint where time is being spent for each router being processed, but that will not be in either of the later releases you mentioned. Maybe if you could install your test environment with something much newer it would help resolve or debug the issue better?
Using the OVN mechanism driver totally eliminates the l3-agent, but I believe you'd need to jump to Victoria (?) in order to use that.
-Brian
Hi, On środa, 8 września 2021 23:34:46 CEST Patryk Jakuszew wrote:
Hi Brian,
On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev@gmail.com> wrote:
The first of these, getting info from neutron-server, was initially fixed in 2015 and has been enhanced over the years - retrieving routers in 'chunks' to reduce the load on neutron-server, since trying to get info on 250 routers is a large response to construct. When this happens do you see neutron-server under heavy load? It might be you need to tune the number of RPC workers in this instance to help.
The second has also been slowly improved on the l3-agent side in a number of ways, for example, by dynamically increasing worker threads when long backlogs occur (not in Rocky). Other changes like using privsep instead of rootwrap has brought the times down slightly as well.
There are probably others I'm not thinking of...
In our test environment I noticed that indeed there was a higher CPU load on neutron-server. I will take a look at both of the options that you mentioned - recently I've seen some mentions of adjusting RPC workers to CPU count in order to improve inter-service communication, but I didn't know about the possibility of switching between privsep and rootwrap.
That said, I've never seen sync times in the 6-8 hour range, I wonder if the systems in question are under any CPU or memory pressure? Are there any other failures in the logs that show things timing out, like RPC failure/retries?
This indeed happened during a full resync caused by a major outage of the entire RabbitMQ cluster. (Upgrade from 3.6.x to 3.9.x went wrong)
Our control plane runs mostly on VMs, with exception of Neutron services which run on dedicated physical nodes. During the upgrade we actually wanted to add more vCPUs to RabbitMQ machines, but after noticing the control plane instability we rolled back that change. I will conduct more tests to see how much load is generated during the resync.
Some other thoughts:
Last year (2020) there were a number of debug messages added to the l3-agent that might help pinpoint where time is being spent for each router being processed, but that will not be in either of the later releases you mentioned. Maybe if you could install your test environment with something much newer it would help resolve or debug the issue better?
Using the OVN mechanism driver totally eliminates the l3-agent, but I believe you'd need to jump to Victoria (?) in order to use that.
-Brian
If newer releases have much more debug information available, then it is definitely worth checking out - I tried gathering some initial information about duration of certain operations by attaching py-spy into neutron-l3-agent (https://github.com/benfred/py-spy), but it didn't actually say how long it took for particular operations to complete.
As for OVN... I have evaluated it a bit on my private environment (packstack all-in-one) and while it does have many welcome improvements like the elimination of separate agent processes, it also misses a feature that makes it a no-go for our production environment - neutron-vpnaas support. We have *lots* of users that would not be happy if we took away neutron-vpnaas. :/
Support for vpnaas in OVN backend is already reported as RFE: https:// bugs.launchpad.net/neutron/+bug/1905391 - unfortunatelly that work stopped some time ago and there is no progress now. But maybe You would have time and want to help with it - any help is welcome :)
Thank you very much for all the information - now I have some additional directions to look at.
-- Best regards, Patryk Jakuszew
On Wed, 8 Sept 2021 at 21:10, Brian Haley <haleyb.dev@gmail.com> wrote:
Hi Patryk,
Yes, the re-synchronization of the l3-agent can sometimes be time consuming. A number of things have been added over the years to help speed this up, some are in Rocky, some are in later releases.
On 9/8/21 8:48 AM, Patryk Jakuszew wrote:
Hello,
I have a long-standing issue with L3 Agent which I would like to finally solve - *very* slow router provisioning in L3 Agent.
We are operating a Rocky-based OpenStack deployment with three bare-metal L3 Agent nodes running in legacy mode. After restarting the L3 node, it takes a really long time for the L3 agent to become fully operational. There are two parts of resync which take much time: getting a list of routers from neutron-server and actually recreate them in the L3 node.
The first of these, getting info from neutron-server, was initially fixed in 2015 and has been enhanced over the years - retrieving routers in 'chunks' to reduce the load on neutron-server, since trying to get info on 250 routers is a large response to construct. When this happens do you see neutron-server under heavy load? It might be you need to tune the number of RPC workers in this instance to help.
The second has also been slowly improved on the l3-agent side in a number of ways, for example, by dynamically increasing worker threads when long backlogs occur (not in Rocky). Other changes like using privsep instead of rootwrap has brought the times down slightly as well.
There are probably others I'm not thinking of...
While the long running time of router list retrieval is somewhat understandable, the router provisioning process itself proves to be very troublesome in our operations tasks. In our production deployment with around 250 routers, it takes around 2 hours (!) to recreate the router namespaces and have the L3 node fully functional again. Two hours of router re-provisioning is actually an optimistic scenario, this proved to be much longer during the outages we encountered (sometimes the sync took nearly 6-8 hours). This effectively prolongs any maintenance upgrades, configuration changes and OpenStack release upgrades.
Another thing is, on that same production environment the first 100 routers usually get provisioned fast (around 30 minutes), after that it slows down with each router - this kind of non deterministic behavior makes it hard to communicate the maintenance finish ETA for our users.
We also have a test environment with Stein already installed, where this problem is also present - full resync of 150 routers, with only one external gateway ports, takes around an hour to complete.
Are there any operators here who also encountered that issue? Does anyone have any experience with similar situation and are willing to share their observations and optimizations?
Yes, I know of other operators that have encountered this issue, and the community has tried to address it over the years. It seems you might have some of the fixes, but not all of them, and some tuning of worker threads might help.
That said, I've never seen sync times in the 6-8 hour range, I wonder if the systems in question are under any CPU or memory pressure? Are there any other failures in the logs that show things timing out, like RPC failure/retries?
Some other thoughts:
Last year (2020) there were a number of debug messages added to the l3-agent that might help pinpoint where time is being spent for each router being processed, but that will not be in either of the later releases you mentioned. Maybe if you could install your test environment with something much newer it would help resolve or debug the issue better?
Using the OVN mechanism driver totally eliminates the l3-agent, but I believe you'd need to jump to Victoria (?) in order to use that.
-Brian
-- Slawek Kaplonski Principal Software Engineer Red Hat
participants (3)
-
Brian Haley
-
Patryk Jakuszew
-
Slawek Kaplonski