[neutron][OpenStack-ansible] Performance issues with trunk ports
John Bartelme
bartelme at gmail.com
Tue Apr 4 16:06:48 UTC 2023
Hi Lajos,
I've created https://bugs.launchpad.net/neutron/+bug/2015275. Please
let me know if you have any questions or concerns.
Thanks, john
On 4/4/23, Lajos Katona <katonalala at gmail.com> wrote:
> Hi,
> Perfect, please do that.
>
> Lajos
>
> John Bartelme <bartelme at gmail.com> ezt írta (időpont: 2023. ápr. 4., K,
> 15:12):
>
>> When you say trunk issue do you mean about the RPC calls going to
>> uWSGI threads or this general issue with long times. For the long
>> times I'm not sure I have enough detail to write a bug but I could for
>> the RPC calls.
>>
>> Also I'm using LinuxBridge on the backend.
>>
>> Thanks, john
>>
>> On 4/4/23, Lajos Katona <katonalala at gmail.com> wrote:
>> > Hi,
>> > could you open a bug report on https://bugs.launchpad.net/neutron/ for
>> the
>> > trunk issue with reproduction steps?
>> > It is also important which backend you use? OVS or something else?
>> >
>> > Thanks in advance
>> > Lajos Katona (lajoskatona)
>> >
>> > John Bartelme <bartelme at gmail.com> ezt írta (időpont: 2023. ápr. 4., K,
>> > 14:15):
>> >
>> >> Hello,
>> >>
>> >> I'm currently experiencing some pretty severe performance issues with
>> >> my
>> >> openstack-ansible deployed cluster(yoga) while deploying trunk ports
>> >> and
>> >> I'm looking for some help determining what might be the cause of this
>> >> poor
>> >> performance.
>> >>
>> >> In my simplest case I'm deploying 2 servers each with one trunk port
>> >> each.
>> >> The first trunk has 2 subports and the second 6 subports. Both servers
>> >> also
>> >> have 3 other regular ports. When deploying the first trunk port its
>> >> subports are often provisioned quickly and the second trunk port takes
>> >> anywhere from 30 seconds to 18 minutes. This happens even when I
>> >> isolate
>> >> neutron-server to a single physical machine with 44(88 threads) and
>> 256GB
>> >> ram. Further diagnosis has shown me some things i didn't quite
>> >> understand.
>> >> My deployment with OpenStack-ansible deploys neutron-server with 16
>> uWSGI
>> >> processes and neutron-rpc-server with 16 rpc workers. However the way
>> >> that
>> >> the trunk RPC server is implemented it is only run on the parent RPC
>> >> thread
>> >> and instead runs in all of the uWSGI processes as well. This means
>> >> that
>> >> most of my trunk RPC calls are being handled by the uWSGI instead of
>> >> the
>> >> RPC workers. When the parent RPC thread handles the trunk port
>> >> creation
>> >> calls I constantly see creation times of 1-1.5 seconds. I've isolated
>> >> it
>> >> so
>> >> that this thread does all of the trunk RPC calls and this works quite
>> >> well
>> >> but this doesn't seem ideal. What could be causing such poor
>> >> performance
>> >> in
>> >> the uWSGI side of the house? I'm having a really hard time getting a
>> good
>> >> feeling for what might be slowing it down so much. I'm wondering if it
>> >> could be green thread preemption but I really don't know. I've tried
>> >> setting 'enable-threads' false for uWSGI but I don't think that is
>> >> improving performance. Putting the profiled decorator on
>> >> update_subport_bindings shows different places taking longer every
>> >> time,
>> >> but in general a lot of time(tottime, i.e. not subfunction time) spent
>> in
>> >> webob/dec.py(__call__), paste/urlmap.py(__call__),
>> >> webob/request.py(call_application),webob/request.py(send). What else
>> >> can
>> >> I
>> >> do to try and look for why this is taking so long?
>> >>
>> >> As a side question it seems counterintuitive that the uWSGI handles
>> >> most
>> >> of
>> >> the trunk RPC calls and not the RPC workers?
>> >>
>> >> A couple other notes about my environment that could indicate my
>> >> challenges:
>> >>
>> >> I had to disable rabbitmq heartbeats for neutron as they kept not
>> getting
>> >> sent reliably and connections were terminated. I tried with
>> >> heartbeat_in_pthread both true and false but still had issues. It
>> >> looks
>> >> like nova also sometimes experiences this but not near as often.
>> >>
>> >> I was overzealous with my vxlan ranges in my first configuration and
>> gave
>> >> it a range of 10,000,000 not realizing that would create that many
>> >> rows
>> >> in
>> >> the database. Looking into that I saw that pymysql in my cluster takes
>> >> 3.5
>> >> minutes to retrieve those rows. mysql CLI only takes 4 seconds.
>> >> Perhaps
>> >> that is just the overhead of pymysql? I've greatly scaled down the
>> >> vxlan
>> >> range now.
>> >>
>> >> I'm provisioning the 2 servers with a heat template that contains
>> >> around
>> >> 200 custom resources. For 198 of the resources they are set to
>> >> conditionally not create any OpenStack native resources. Deploying
>> >> this
>> >> template of mostly no-op resources still takes about 3 minutes.
>> >>
>> >> Horizon works but almost every page load take a few seconds to load.
>> >> I'm
>> >> not sure if that is normal or not.
>> >>
>> >> Thanks for any help anyone can provide.
>> >>
>> >> john
>> >>
>> >>
>> >
>>
>
More information about the openstack-discuss
mailing list