RabbitMQ performance improvement for single-control node
Hi *, I would greatly appreciate if anyone could help identifying some tweaks, if possible at all. This is a single control node environment with 19 compute nodes which are heavily used. It's still running on Pike and we won't be able to update until complete redeployment. I was able to improve the performance a lot by increasing the number of workers etc., but to me it seems the next bottleneck is rabbitmq. The rabbit process shows heavy CPU usage, almost constantly around 150% according to 'top' and I see heartbeat errors in the logs (e.g. cinder). Since there are so many config options I haven't touched them yet. Reading some of the tuning docs often refer to HA environments which is not the case here. I can paste some of the config snippets, maybe that already helps identifying anything: # nova.conf [oslo_messaging_rabbit] pool_max_size = 1600 pool_max_overflow = 1600 pool_timeout = 120 rpc_response_timeout = 120 rpc_conn_pool_size = 600
Sorry, I accidentally hit "send"...adding more information. # nova.conf [oslo_messaging_notifications] driver = messagingv2 [oslo_messaging_rabbit] amqp_durable_queues = false rabbit_ha_queues = false ssl = false heartbeat_timeout_threshold = 10 # neutron.conf [oslo_messaging_rabbit] pool_max_size = 5000 pool_max_overflow = 2000 pool_timeout = 60 These are the basics in the openstack services, I don't have access to the rabbit.conf right now but I can add more information as soon as I regain access to that machine. If there's anything else I can provide to improve the user experience at least a bit please let me know. It would help a lot! Regards, Eugen Zitat von Eugen Block <eblock@nde.ag>:
Hi *,
I would greatly appreciate if anyone could help identifying some tweaks, if possible at all.
This is a single control node environment with 19 compute nodes which are heavily used. It's still running on Pike and we won't be able to update until complete redeployment. I was able to improve the performance a lot by increasing the number of workers etc., but to me it seems the next bottleneck is rabbitmq. The rabbit process shows heavy CPU usage, almost constantly around 150% according to 'top' and I see heartbeat errors in the logs (e.g. cinder). Since there are so many config options I haven't touched them yet. Reading some of the tuning docs often refer to HA environments which is not the case here.
I can paste some of the config snippets, maybe that already helps identifying anything:
# nova.conf
[oslo_messaging_rabbit] pool_max_size = 1600 pool_max_overflow = 1600 pool_timeout = 120 rpc_response_timeout = 120 rpc_conn_pool_size = 600
Do you actually need notifications to be enabled? If not, switch the driver to noop and make sure you purge the notification queues On Fri, Aug 27, 2021 at 4:56 AM Eugen Block <eblock@nde.ag> wrote:
Sorry, I accidentally hit "send"...adding more information.
# nova.conf
[oslo_messaging_notifications] driver = messagingv2 [oslo_messaging_rabbit] amqp_durable_queues = false rabbit_ha_queues = false ssl = false heartbeat_timeout_threshold = 10
# neutron.conf
[oslo_messaging_rabbit] pool_max_size = 5000 pool_max_overflow = 2000 pool_timeout = 60
These are the basics in the openstack services, I don't have access to the rabbit.conf right now but I can add more information as soon as I regain access to that machine.
If there's anything else I can provide to improve the user experience at least a bit please let me know. It would help a lot!
Regards, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi *,
I would greatly appreciate if anyone could help identifying some tweaks, if possible at all.
This is a single control node environment with 19 compute nodes which are heavily used. It's still running on Pike and we won't be able to update until complete redeployment. I was able to improve the performance a lot by increasing the number of workers etc., but to me it seems the next bottleneck is rabbitmq. The rabbit process shows heavy CPU usage, almost constantly around 150% according to 'top' and I see heartbeat errors in the logs (e.g. cinder). Since there are so many config options I haven't touched them yet. Reading some of the tuning docs often refer to HA environments which is not the case here.
I can paste some of the config snippets, maybe that already helps identifying anything:
# nova.conf
[oslo_messaging_rabbit] pool_max_size = 1600 pool_max_overflow = 1600 pool_timeout = 120 rpc_response_timeout = 120 rpc_conn_pool_size = 600
-- Mohammed Naser VEXXHOST, Inc.
Yes, notifications are biggest enemy of rabbitMQ, specially if you don’t purge time to time. Also increase time of stats collection for rabbitMQ monitor if you are using it. Sent from my iPhone
On Aug 27, 2021, at 10:34 PM, Mohammed Naser <mnaser@vexxhost.com> wrote:
Do you actually need notifications to be enabled? If not, switch the driver to noop and make sure you purge the notification queues
On Fri, Aug 27, 2021 at 4:56 AM Eugen Block <eblock@nde.ag> wrote: Sorry, I accidentally hit "send"...adding more information.
# nova.conf
[oslo_messaging_notifications] driver = messagingv2 [oslo_messaging_rabbit] amqp_durable_queues = false rabbit_ha_queues = false ssl = false heartbeat_timeout_threshold = 10
# neutron.conf
[oslo_messaging_rabbit] pool_max_size = 5000 pool_max_overflow = 2000 pool_timeout = 60
These are the basics in the openstack services, I don't have access to the rabbit.conf right now but I can add more information as soon as I regain access to that machine.
If there's anything else I can provide to improve the user experience at least a bit please let me know. It would help a lot!
Regards, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi *,
I would greatly appreciate if anyone could help identifying some tweaks, if possible at all.
This is a single control node environment with 19 compute nodes which are heavily used. It's still running on Pike and we won't be able to update until complete redeployment. I was able to improve the performance a lot by increasing the number of workers etc., but to me it seems the next bottleneck is rabbitmq. The rabbit process shows heavy CPU usage, almost constantly around 150% according to 'top' and I see heartbeat errors in the logs (e.g. cinder). Since there are so many config options I haven't touched them yet. Reading some of the tuning docs often refer to HA environments which is not the case here.
I can paste some of the config snippets, maybe that already helps identifying anything:
# nova.conf
[oslo_messaging_rabbit] pool_max_size = 1600 pool_max_overflow = 1600 pool_timeout = 120 rpc_response_timeout = 120 rpc_conn_pool_size = 600
-- Mohammed Naser VEXXHOST, Inc.
Hi, thank you both for your helpful responses. There are no consumers for the notifications so I'm indeed wondering who enabled them since the default for this deployment method was alse "false". Anyway, I will disable notifications and purge the queues and hopefully see the load dropping on the control node. Thank you very much! Eugen Zitat von Satish Patel <satish.txt@gmail.com>:
Yes, notifications are biggest enemy of rabbitMQ, specially if you don’t purge time to time. Also increase time of stats collection for rabbitMQ monitor if you are using it.
Sent from my iPhone
On Aug 27, 2021, at 10:34 PM, Mohammed Naser <mnaser@vexxhost.com> wrote:
Do you actually need notifications to be enabled? If not, switch the driver to noop and make sure you purge the notification queues
On Fri, Aug 27, 2021 at 4:56 AM Eugen Block <eblock@nde.ag> wrote: Sorry, I accidentally hit "send"...adding more information.
# nova.conf
[oslo_messaging_notifications] driver = messagingv2 [oslo_messaging_rabbit] amqp_durable_queues = false rabbit_ha_queues = false ssl = false heartbeat_timeout_threshold = 10
# neutron.conf
[oslo_messaging_rabbit] pool_max_size = 5000 pool_max_overflow = 2000 pool_timeout = 60
These are the basics in the openstack services, I don't have access to the rabbit.conf right now but I can add more information as soon as I regain access to that machine.
If there's anything else I can provide to improve the user experience at least a bit please let me know. It would help a lot!
Regards, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi *,
I would greatly appreciate if anyone could help identifying some tweaks, if possible at all.
This is a single control node environment with 19 compute nodes which are heavily used. It's still running on Pike and we won't be able to update until complete redeployment. I was able to improve the performance a lot by increasing the number of workers etc., but to me it seems the next bottleneck is rabbitmq. The rabbit process shows heavy CPU usage, almost constantly around 150% according to 'top' and I see heartbeat errors in the logs (e.g. cinder). Since there are so many config options I haven't touched them yet. Reading some of the tuning docs often refer to HA environments which is not the case here.
I can paste some of the config snippets, maybe that already helps identifying anything:
# nova.conf
[oslo_messaging_rabbit] pool_max_size = 1600 pool_max_overflow = 1600 pool_timeout = 120 rpc_response_timeout = 120 rpc_conn_pool_size = 600
-- Mohammed Naser VEXXHOST, Inc.
participants (3)
-
Eugen Block
-
Mohammed Naser
-
Satish Patel