[neutron][OpenStack-ansible] Performance issues with trunk ports
Dmitriy Rabotyagov
noonedeadpunk at gmail.com
Tue Apr 4 13:28:29 UTC 2023
Hi John,
Have you tried out of interest to set "neutron_use_uwsgi: false" in your
user_variables.yml and re-run os-neutron-install playbook to see if that
just solves your issue?
You might also need to restart service manually after that as we're having
a known bug (scheduled to be fixed soon) that will skip service restart if
only systemd service file is changed. Not sure if neutron role is affected
or not though, but decided to mention that it might be needed.
вт, 4 апр. 2023 г., 15:19 Lajos Katona <katonalala at gmail.com>:
> Hi,
> Perfect, please do that.
>
> Lajos
>
> John Bartelme <bartelme at gmail.com> ezt írta (időpont: 2023. ápr. 4., K,
> 15:12):
>
>> When you say trunk issue do you mean about the RPC calls going to
>> uWSGI threads or this general issue with long times. For the long
>> times I'm not sure I have enough detail to write a bug but I could for
>> the RPC calls.
>>
>> Also I'm using LinuxBridge on the backend.
>>
>> Thanks, john
>>
>> On 4/4/23, Lajos Katona <katonalala at gmail.com> wrote:
>> > Hi,
>> > could you open a bug report on https://bugs.launchpad.net/neutron/ for
>> the
>> > trunk issue with reproduction steps?
>> > It is also important which backend you use? OVS or something else?
>> >
>> > Thanks in advance
>> > Lajos Katona (lajoskatona)
>> >
>> > John Bartelme <bartelme at gmail.com> ezt írta (időpont: 2023. ápr. 4., K,
>> > 14:15):
>> >
>> >> Hello,
>> >>
>> >> I'm currently experiencing some pretty severe performance issues with
>> my
>> >> openstack-ansible deployed cluster(yoga) while deploying trunk ports
>> and
>> >> I'm looking for some help determining what might be the cause of this
>> >> poor
>> >> performance.
>> >>
>> >> In my simplest case I'm deploying 2 servers each with one trunk port
>> >> each.
>> >> The first trunk has 2 subports and the second 6 subports. Both servers
>> >> also
>> >> have 3 other regular ports. When deploying the first trunk port its
>> >> subports are often provisioned quickly and the second trunk port takes
>> >> anywhere from 30 seconds to 18 minutes. This happens even when I
>> isolate
>> >> neutron-server to a single physical machine with 44(88 threads) and
>> 256GB
>> >> ram. Further diagnosis has shown me some things i didn't quite
>> >> understand.
>> >> My deployment with OpenStack-ansible deploys neutron-server with 16
>> uWSGI
>> >> processes and neutron-rpc-server with 16 rpc workers. However the way
>> >> that
>> >> the trunk RPC server is implemented it is only run on the parent RPC
>> >> thread
>> >> and instead runs in all of the uWSGI processes as well. This means that
>> >> most of my trunk RPC calls are being handled by the uWSGI instead of
>> the
>> >> RPC workers. When the parent RPC thread handles the trunk port creation
>> >> calls I constantly see creation times of 1-1.5 seconds. I've isolated
>> it
>> >> so
>> >> that this thread does all of the trunk RPC calls and this works quite
>> >> well
>> >> but this doesn't seem ideal. What could be causing such poor
>> performance
>> >> in
>> >> the uWSGI side of the house? I'm having a really hard time getting a
>> good
>> >> feeling for what might be slowing it down so much. I'm wondering if it
>> >> could be green thread preemption but I really don't know. I've tried
>> >> setting 'enable-threads' false for uWSGI but I don't think that is
>> >> improving performance. Putting the profiled decorator on
>> >> update_subport_bindings shows different places taking longer every
>> time,
>> >> but in general a lot of time(tottime, i.e. not subfunction time) spent
>> in
>> >> webob/dec.py(__call__), paste/urlmap.py(__call__),
>> >> webob/request.py(call_application),webob/request.py(send). What else
>> can
>> >> I
>> >> do to try and look for why this is taking so long?
>> >>
>> >> As a side question it seems counterintuitive that the uWSGI handles
>> most
>> >> of
>> >> the trunk RPC calls and not the RPC workers?
>> >>
>> >> A couple other notes about my environment that could indicate my
>> >> challenges:
>> >>
>> >> I had to disable rabbitmq heartbeats for neutron as they kept not
>> getting
>> >> sent reliably and connections were terminated. I tried with
>> >> heartbeat_in_pthread both true and false but still had issues. It looks
>> >> like nova also sometimes experiences this but not near as often.
>> >>
>> >> I was overzealous with my vxlan ranges in my first configuration and
>> gave
>> >> it a range of 10,000,000 not realizing that would create that many rows
>> >> in
>> >> the database. Looking into that I saw that pymysql in my cluster takes
>> >> 3.5
>> >> minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps
>> >> that is just the overhead of pymysql? I've greatly scaled down the
>> vxlan
>> >> range now.
>> >>
>> >> I'm provisioning the 2 servers with a heat template that contains
>> around
>> >> 200 custom resources. For 198 of the resources they are set to
>> >> conditionally not create any OpenStack native resources. Deploying this
>> >> template of mostly no-op resources still takes about 3 minutes.
>> >>
>> >> Horizon works but almost every page load take a few seconds to load.
>> I'm
>> >> not sure if that is normal or not.
>> >>
>> >> Thanks for any help anyone can provide.
>> >>
>> >> john
>> >>
>> >>
>> >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230404/9ebc3946/attachment.htm>
More information about the openstack-discuss
mailing list