[neutron][OpenStack-ansible] Performance issues with trunk ports

Lajos Katona katonalala at gmail.com
Tue Apr 4 13:14:32 UTC 2023


Hi,
Perfect, please do that.

Lajos

John Bartelme <bartelme at gmail.com> ezt írta (időpont: 2023. ápr. 4., K,
15:12):

> When you say trunk issue do you mean about the RPC calls going to
> uWSGI threads or this general issue with long times.  For the long
> times I'm not sure I have enough detail to write a bug but I could for
> the RPC calls.
>
> Also I'm using LinuxBridge on the backend.
>
> Thanks, john
>
> On 4/4/23, Lajos Katona <katonalala at gmail.com> wrote:
> > Hi,
> > could you open a bug report on https://bugs.launchpad.net/neutron/ for
> the
> > trunk issue with reproduction steps?
> > It is also important which backend you use? OVS or something else?
> >
> > Thanks in advance
> > Lajos Katona (lajoskatona)
> >
> > John Bartelme <bartelme at gmail.com> ezt írta (időpont: 2023. ápr. 4., K,
> > 14:15):
> >
> >> Hello,
> >>
> >> I'm currently experiencing some pretty severe performance issues with my
> >> openstack-ansible deployed cluster(yoga) while deploying trunk ports and
> >> I'm looking for some help determining what might be the cause of this
> >> poor
> >> performance.
> >>
> >> In my simplest case I'm deploying 2 servers each with one trunk port
> >> each.
> >> The first trunk has 2 subports and the second 6 subports. Both servers
> >> also
> >> have 3 other regular ports. When deploying the first trunk port its
> >> subports are often provisioned quickly and the second trunk port takes
> >> anywhere from 30 seconds to 18 minutes. This happens even when I isolate
> >> neutron-server to a single physical machine with 44(88 threads) and
> 256GB
> >> ram. Further diagnosis has shown me some things i didn't quite
> >> understand.
> >> My deployment with OpenStack-ansible deploys neutron-server with 16
> uWSGI
> >> processes and neutron-rpc-server with 16 rpc workers. However the way
> >> that
> >> the trunk RPC server is implemented it is only run on the parent RPC
> >> thread
> >> and instead runs in all of the uWSGI processes as well. This means that
> >> most of my trunk RPC calls are being handled by the uWSGI instead of the
> >> RPC workers. When the parent RPC thread handles the trunk port creation
> >> calls I constantly see creation times of 1-1.5 seconds. I've isolated it
> >> so
> >> that this thread does all of the trunk RPC calls and this works quite
> >> well
> >> but this doesn't seem ideal. What could be causing such poor performance
> >> in
> >> the uWSGI side of the house? I'm having a really hard time getting a
> good
> >> feeling for what might be slowing it down so much. I'm wondering if it
> >> could be green thread preemption but I really don't know. I've tried
> >> setting 'enable-threads' false for uWSGI but I don't think that is
> >> improving performance. Putting the profiled decorator on
> >> update_subport_bindings shows different places taking longer every time,
> >> but in general a lot of time(tottime, i.e. not subfunction time) spent
> in
> >> webob/dec.py(__call__), paste/urlmap.py(__call__),
> >> webob/request.py(call_application),webob/request.py(send). What else can
> >> I
> >> do to try and look for why this is taking so long?
> >>
> >> As a side question it seems counterintuitive that the uWSGI handles most
> >> of
> >> the trunk RPC calls and not the RPC workers?
> >>
> >> A couple other notes about my environment that could indicate my
> >> challenges:
> >>
> >> I had to disable rabbitmq heartbeats for neutron as they kept not
> getting
> >> sent reliably and connections were terminated. I tried with
> >> heartbeat_in_pthread both true and false but still had issues. It looks
> >> like nova also sometimes experiences this but not near as often.
> >>
> >> I was overzealous with my vxlan ranges in my first configuration and
> gave
> >> it a range of 10,000,000 not realizing that would create that many rows
> >> in
> >> the database. Looking into that I saw that pymysql in my cluster takes
> >> 3.5
> >> minutes to retrieve those rows. mysql CLI only takes 4 seconds. Perhaps
> >> that is just the overhead of pymysql? I've greatly scaled down the vxlan
> >> range now.
> >>
> >> I'm provisioning the 2 servers with a heat template that contains around
> >> 200 custom resources. For 198 of the resources they are set to
> >> conditionally not create any OpenStack native resources. Deploying this
> >> template of mostly no-op resources still takes about 3 minutes.
> >>
> >> Horizon works but almost every page load take a few seconds to load. I'm
> >> not sure if that is normal or not.
> >>
> >> Thanks for any help anyone can provide.
> >>
> >> john
> >>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230404/31b30605/attachment-0001.htm>


More information about the openstack-discuss mailing list