[neutron][OpenStack-ansible] Performance issues with trunk ports
Hello, I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance. In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long? As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers? A couple other notes about my environment that could indicate my challenges: I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often. I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now. I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes. Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not. Thanks for any help anyone can provide. john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for the trunk issue with reproduction steps? It is also important which backend you use? OVS or something else? Thanks in advance Lajos Katona (lajoskatona) John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls. Also I'm using LinuxBridge on the backend. Thanks, john On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote:
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for the trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
Hi, Perfect, please do that. Lajos John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 15:12):
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls.
Also I'm using LinuxBridge on the backend.
Thanks, john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote: the
trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
Hi John, Have you tried out of interest to set "neutron_use_uwsgi: false" in your user_variables.yml and re-run os-neutron-install playbook to see if that just solves your issue? You might also need to restart service manually after that as we're having a known bug (scheduled to be fixed soon) that will skip service restart if only systemd service file is changed. Not sure if neutron role is affected or not though, but decided to mention that it might be needed. вт, 4 апр. 2023 г., 15:19 Lajos Katona <katonalala@gmail.com>:
Hi, Perfect, please do that.
Lajos
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 15:12):
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls.
Also I'm using LinuxBridge on the backend.
Thanks, john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for
trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of
RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote: the the performance
in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
Dmitriy wrote:
Have you tried out of interest to set "neutron_use_uwsgi: false" in your user_variables.yml
Thank you for that suggestion. I did think about changing that option but reading up on some of the change logs it looked like everything is trying to be migrated over to uWSGI. When I set that option things are indeed much better. The update_subport_bindings RPC call is still not being handled by the RPC worker threads but the neutron-server parent thread is able to handle the calls and much more quickly than the uWSGI threads were, i.e. in that 1-2 second timeframe. What are the ramifications of not using uWSGI? Is this an ok configuration for a production deployment? Are there any thoughts as to why the uWSGI threads are having such performance issues? Thanks so much for all of the help. I’ll continue to write up a bug for RPC threads not handling update_subport_bindings calls and for uWSGI handling them which may be unexpected. Thanks, john On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote:
Hi, Perfect, please do that.
Lajos
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 15:12):
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls.
Also I'm using LinuxBridge on the backend.
Thanks, john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote: the
trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
Yes, totally, that would be great to sort out why using uWSGI make such difference in performance. I was just trying to provide you with a fix solution in the meantime, if its needed and affecting your production deployment. Also, if neutron_use_uwsgi is set to false, neutron-rpc-server should be stopped as this service is not used for scenario without uWSGI at all. Though I can imagine having a bug in OpenStack-Ansible that we leave neutron-rpc-server running when switching from uwsgi to eventlet, despite it not needed for that scenario. In addition to that, I'm not sure about the current state, but during Zed uWSGI known not to work at all with OVN driver, for instance. It could be fixed now though. вт, 4 апр. 2023 г. в 16:47, John Bartelme <bartelme@gmail.com>:
Dmitriy wrote:
Have you tried out of interest to set "neutron_use_uwsgi: false" in your user_variables.yml
Thank you for that suggestion. I did think about changing that option but reading up on some of the change logs it looked like everything is trying to be migrated over to uWSGI. When I set that option things are indeed much better. The update_subport_bindings RPC call is still not being handled by the RPC worker threads but the neutron-server parent thread is able to handle the calls and much more quickly than the uWSGI threads were, i.e. in that 1-2 second timeframe. What are the ramifications of not using uWSGI? Is this an ok configuration for a production deployment? Are there any thoughts as to why the uWSGI threads are having such performance issues? Thanks so much for all of the help.
I’ll continue to write up a bug for RPC threads not handling update_subport_bindings calls and for uWSGI handling them which may be unexpected.
Thanks, john
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote:
Hi, Perfect, please do that.
Lajos
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 15:12):
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls.
Also I'm using LinuxBridge on the backend.
Thanks, john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote: the
trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
Hello John: About the Trunk plugin and the RPC calls. This is a flaw in the design of the Trunk service plugin. The RPC backend instantiated and the RPC calls registered during the Neutron manager initialization. This is done before the API and RPC workers are created. Because of this, the Trunk plugin RPC calls are attended by the main thread only. That is something to be improved for sure. About the VXLAN ranges. It is recommended to limit the range that could be used by Neutron. 4 seconds is still a lot of time for a table so simple (two columns without any external reference). But for sure 3 minutes is not practical. You need to investigate the poor performance of this engine with this table. But as commented, in order to mitigate that poor performance, you can probably reduce to 1K or 2K the number of VXLAN ranges. I'll check the launchpad bug reported. Thanks! Regards. On Tue, Apr 4, 2023 at 6:09 PM Dmitriy Rabotyagov <noonedeadpunk@gmail.com> wrote:
Yes, totally, that would be great to sort out why using uWSGI make such difference in performance. I was just trying to provide you with a fix solution in the meantime, if its needed and affecting your production deployment.
Also, if neutron_use_uwsgi is set to false, neutron-rpc-server should be stopped as this service is not used for scenario without uWSGI at all. Though I can imagine having a bug in OpenStack-Ansible that we leave neutron-rpc-server running when switching from uwsgi to eventlet, despite it not needed for that scenario.
In addition to that, I'm not sure about the current state, but during Zed uWSGI known not to work at all with OVN driver, for instance. It could be fixed now though.
вт, 4 апр. 2023 г. в 16:47, John Bartelme <bartelme@gmail.com>:
Dmitriy wrote:
Have you tried out of interest to set "neutron_use_uwsgi: false" in
your user_variables.yml
Thank you for that suggestion. I did think about changing that option but reading up on some of the change logs it looked like everything is trying to be migrated over to uWSGI. When I set that option things are indeed much better. The update_subport_bindings RPC call is still not being handled by the RPC worker threads but the neutron-server parent thread is able to handle the calls and much more quickly than the uWSGI threads were, i.e. in that 1-2 second timeframe. What are the ramifications of not using uWSGI? Is this an ok configuration for a production deployment? Are there any thoughts as to why the uWSGI threads are having such performance issues? Thanks so much for all of the help.
I’ll continue to write up a bug for RPC threads not handling update_subport_bindings calls and for uWSGI handling them which may be unexpected.
Thanks, john
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote:
Hi, Perfect, please do that.
Lajos
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4.,
15:12):
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls.
Also I'm using LinuxBridge on the backend.
Thanks, john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote: the
trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk
and I'm looking for some help determining what might be the cause of
poor performance.
In my simplest case I'm deploying 2 servers each with one trunk
each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've
K, ports this port tried
setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
Hi Lajos, I've created https://bugs.launchpad.net/neutron/+bug/2015275. Please let me know if you have any questions or concerns. Thanks, john On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote:
Hi, Perfect, please do that.
Lajos
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 15:12):
When you say trunk issue do you mean about the RPC calls going to uWSGI threads or this general issue with long times. For the long times I'm not sure I have enough detail to write a bug but I could for the RPC calls.
Also I'm using LinuxBridge on the backend.
Thanks, john
Hi, could you open a bug report on https://bugs.launchpad.net/neutron/ for
On 4/4/23, Lajos Katona <katonalala@gmail.com> wrote: the
trunk issue with reproduction steps? It is also important which backend you use? OVS or something else?
Thanks in advance Lajos Katona (lajoskatona)
John Bartelme <bartelme@gmail.com> ezt írta (időpont: 2023. ápr. 4., K, 14:15):
Hello,
I'm currently experiencing some pretty severe performance issues with my openstack-ansible deployed cluster(yoga) while deploying trunk ports and I'm looking for some help determining what might be the cause of this poor performance.
In my simplest case I'm deploying 2 servers each with one trunk port each. The first trunk has 2 subports and the second 6 subports. Both servers also have 3 other regular ports. When deploying the first trunk port its subports are often provisioned quickly and the second trunk port takes anywhere from 30 seconds to 18 minutes. This happens even when I isolate neutron-server to a single physical machine with 44(88 threads) and 256GB ram. Further diagnosis has shown me some things i didn't quite understand. My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI processes and neutron-rpc-server with 16 rpc workers. However the way that the trunk RPC server is implemented it is only run on the parent RPC thread and instead runs in all of the uWSGI processes as well. This means that most of my trunk RPC calls are being handled by the uWSGI instead of the RPC workers. When the parent RPC thread handles the trunk port creation calls I constantly see creation times of 1-1.5 seconds. I've isolated it so that this thread does all of the trunk RPC calls and this works quite well but this doesn't seem ideal. What could be causing such poor performance in the uWSGI side of the house? I'm having a really hard time getting a good feeling for what might be slowing it down so much. I'm wondering if it could be green thread preemption but I really don't know. I've tried setting 'enable-threads' false for uWSGI but I don't think that is improving performance. Putting the profiled decorator on update_subport_bindings shows different places taking longer every time, but in general a lot of time(tottime, i.e. not subfunction time) spent in webob/dec.py(__call__), paste/urlmap.py(__call__), webob/request.py(call_application),webob/request.py(send). What else can I do to try and look for why this is taking so long?
As a side question it seems counterintuitive that the uWSGI handles most of the trunk RPC calls and not the RPC workers?
A couple other notes about my environment that could indicate my challenges:
I had to disable rabbitmq heartbeats for neutron as they kept not getting sent reliably and connections were terminated. I tried with heartbeat_in_pthread both true and false but still had issues. It looks like nova also sometimes experiences this but not near as often.
I was overzealous with my vxlan ranges in my first configuration and gave it a range of 10,000,000 not realizing that would create that many rows in the database. Looking into that I saw that pymysql in my cluster takes 3.5 minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps that is just the overhead of pymysql? I've greatly scaled down the vxlan range now.
I'm provisioning the 2 servers with a heat template that contains around 200 custom resources. For 198 of the resources they are set to conditionally not create any OpenStack native resources. Deploying this template of mostly no-op resources still takes about 3 minutes.
Horizon works but almost every page load take a few seconds to load. I'm not sure if that is normal or not.
Thanks for any help anyone can provide.
john
participants (4)
-
Dmitriy Rabotyagov
-
John Bartelme
-
Lajos Katona
-
Rodolfo Alonso Hernandez