[neutron][OpenStack-ansible] Performance issues with trunk ports

4 Apr 2023

      Hello,

I'm currently experiencing some pretty severe performance issues with my
openstack-ansible deployed cluster(yoga) while deploying trunk ports and
I'm looking for some help determining what might be the cause of this poor
performance.

In my simplest case I'm deploying 2 servers each with one trunk port each.
The first trunk has 2 subports and the second 6 subports. Both servers also
have 3 other regular ports. When deploying the first trunk port its
subports are often provisioned quickly and the second trunk port takes
anywhere from 30 seconds to 18 minutes. This happens even when I isolate
neutron-server to a single physical machine with 44(88 threads) and 256GB
ram. Further diagnosis has shown me some things i didn't quite understand.
My deployment with OpenStack-ansible deploys neutron-server with 16 uWSGI
processes and neutron-rpc-server with 16 rpc workers. However the way that
the trunk RPC server is implemented it is only run on the parent RPC thread
and instead runs in all of the uWSGI processes as well. This means that
most of my trunk RPC calls are being handled by the uWSGI instead of the
RPC workers. When the parent RPC thread handles the trunk port creation
calls I constantly see creation times of 1-1.5 seconds. I've isolated it so
that this thread does all of the trunk RPC calls and this works quite well
but this doesn't seem ideal. What could be causing such poor performance in
the uWSGI side of the house? I'm having a really hard time getting a good
feeling for what might be slowing it down so much. I'm wondering if it
could be green thread preemption but I really don't know. I've tried
setting 'enable-threads' false for uWSGI but I don't think that is
improving performance. Putting the profiled decorator on
update_subport_bindings shows different places taking longer every time,
but in general a lot of time(tottime, i.e. not subfunction time) spent in
webob/dec.py(__call__), paste/urlmap.py(__call__),
webob/request.py(call_application),webob/request.py(send). What else can I
do to try and look for why this is taking so long?

As a side question it seems counterintuitive that the uWSGI handles most of
the trunk RPC calls and not the RPC workers?

A couple other notes about my environment that could indicate my challenges:

I had to disable rabbitmq heartbeats for neutron as they kept not getting
sent reliably and connections were terminated. I tried with
heartbeat_in_pthread both true and false but still had issues. It looks
like nova also sometimes experiences this but not near as often.

I was overzealous with my vxlan ranges in my first configuration and gave
it a range of 10,000,000 not realizing that would create that many rows in
the database. Looking into that I saw that pymysql in my cluster takes 3.5
minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps
that is just the overhead of pymysql? I've greatly scaled down the vxlan
range now.

I'm provisioning the 2 servers with a heat template that contains around
200 custom resources. For 198 of the resources they are set to
conditionally not create any OpenStack native resources. Deploying this
template of mostly no-op resources still takes about 3 minutes.

Horizon works but almost every page load take a few seconds to load. I'm
not sure if that is normal or not.

Thanks for any help anyone can provide.

john

John Bartelme

Lajos Katona

John Bartelme

Lajos Katona

Dmitriy Rabotyagov

John Bartelme

Dmitriy Rabotyagov

Rodolfo Alonso Hernandez

John Bartelme

tags

participants (4)