Openstack cluster cannot create instances when 1 of 3 rabbitmq cluster node down

ROBERTO BARTZEN ACOSTA roberto.acosta at luizalabs.com
Mon Oct 24 11:52:38 UTC 2022


Hey folks,

I believe this problem is related to the maximum timeout in the pool loop,
and was introduced in this thread [1] with this specific commit [2].

[1] https://bugs.launchpad.net/oslo.messaging/+bug/1935864
[2]
https://opendev.org/openstack/oslo.messaging/commit/bdcf915e788bb368774e5462ccc15e6f5b7223d7

Corey Bryant proposed a workaround removing this commit [2] and building an
alternate ubuntu pkg in this thread [3], but the root cause needs to be
investigated because it was originally modified to solve the issue [1].

[3]
https://bugs.launchpad.net/ubuntu/jammy/+source/python-oslo.messaging/+bug/1993149

Regards,
Roberto



Em seg., 24 de out. de 2022 às 08:30, Nguyễn Hữu Khôi <
nguyenhuukhoinw at gmail.com> escreveu:

> Hello. Sorry for that.
> I just want to notice that both nova and cinder have this problem,
> When diving to logs on both service I see:
> ERROR oslo.messaging._drivers.impl_rabbit [-]
> [8634b511-7eee-4e50-8efd-b96d420e9914] AMQP server on [node was down]:5672
> is unreachable: <RecoverableConnectionError: unknown error>. Trying again
> in 1 seconds.: amqp.exceptions.RecoverableConnectionError:
> <RecoverableConnectionError: unknown error>
>
> and
>
>
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback (most
> recent call last):
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
> line 441, in get
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
> self._queues[msg_id].get(block=True, timeout=timeout)
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line
> 322, in get
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
> waiter.wait()
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line
> 141, in wait
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
> get_hub().switch()
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/hubs/hub.py",
> line 313, in switch2022-10-24 14:23:01.945 7 ERROR
> oslo_service.periodic_task     return self.greenlet.switch()
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task _queue.Empty
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task During handling
> of the above exception, another exception occurred:
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback (most
> recent call last):
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_service/periodic_task.py",
> line 216, in run_periodic_tasks
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     task(self,
> context)
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/compute/manager.py",
> line 9716, in _sync_power_states
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
> db_instances = objects.InstanceList.get_by_host(context, self.host,
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_versionedobjects/base.py",
> line 175, in wrapper
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
> cls.indirection_api.object_class_action_versions(
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/rpcapi.py",
> line 240, in object_class_action_versions
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
> cctxt.call(context, 'object_class_action_versions',
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py",
> line 189, in call
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
> self.transport._send(
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py",
> line 123, in _send
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
> self._driver.send(target, ctxt, message,
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
> line 689, in send
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
> self._send(target, ctxt, message, wait_for_reply, timeout,
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
> line 678, in _send
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
> self._waiter.wait(msg_id, timeout,
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
> line 567, in wait
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     message =
> self.waiters.get(msg_id, timeout=timeout)
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
> line 443, in get
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     raise
> oslo_messaging.MessagingTimeout(
> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
> oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply
> to message ID c8a676a9709242908dcff97046d7976d
>
> *** I use cluster rabbitmq with ha-policy for exchange and queue. These
> logs are gone when I restart cinder and nova services.
>
>
>
> Nguyen Huu Khoi
>
>
> On Mon, Oct 24, 2022 at 5:42 PM Eugen Block <eblock at nde.ag> wrote:
>
>> You don't need to create a new thread with the same issue.
>> Do the rabbitmq logs reveal anything? We create a cluster within
>> rabbitmq and the output looks like this:
>>
>> ---snip---
>> control01:~ # rabbitmqctl cluster_status
>> Cluster status of node rabbit at control01 ...
>> Basics
>>
>> Cluster name: rabbit at rabbitmq-cluster
>>
>> Disk Nodes
>>
>> rabbit at control01
>> rabbit at control02
>> rabbit at control03
>>
>> Running Nodes
>>
>> rabbit at control01
>> rabbit at control02
>> rabbit at control03
>>
>> Versions
>>
>> rabbit at control01: RabbitMQ 3.8.3 on Erlang 22.2.7
>> rabbit at control02: RabbitMQ 3.8.3 on Erlang 22.2.7
>> rabbit at control03: RabbitMQ 3.8.3 on Erlang 22.2.7
>> ---snip---
>>
>> During failover it's not unexpected that a message gets lost, but it
>> should be resent, I believe. How is your openstack deployed?
>>
>>
>> Zitat von Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com>:
>>
>> > Hello.
>> > 2 remain nodes still running, here is my output:
>> > Basics
>> >
>> > Cluster name: rabbit at controller01
>> >
>> > Disk Nodes
>> >
>> > rabbit at controller01
>> > rabbit at controller02
>> > rabbit at controller03
>> >
>> > Running Nodes
>> >
>> > rabbit at controller01
>> > rabbit at controller03
>> >
>> > Versions
>> >
>> > rabbit at controller01: RabbitMQ 3.8.35 on Erlang 23.3.4.18
>> > rabbit at controller03: RabbitMQ 3.8.35 on Erlang 23.3.4.18
>> >
>> > Maintenance status
>> >
>> > Node: rabbit at controller01, status: not under maintenance
>> > Node: rabbit at controller03, status: not under maintenance
>> >
>> > Alarms
>> >
>> > (none)
>> >
>> > Network Partitions
>> >
>> > (none)
>> >
>> > Listeners
>> >
>> > Node: rabbit at controller01, interface: [::], port: 15672, protocol:
>> http,
>> > purpose: HTTP API
>> > Node: rabbit at controller01, interface: 183.81.13.227, port: 25672,
>> protocol:
>> > clustering, purpose: inter-node and CLI tool communication
>> > Node: rabbit at controller01, interface: 183.81.13.227, port: 5672,
>> protocol:
>> > amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>> > Node: rabbit at controller03, interface: [::], port: 15672, protocol:
>> http,
>> > purpose: HTTP API
>> > Node: rabbit at controller03, interface: 183.81.13.229, port: 25672,
>> protocol:
>> > clustering, purpose: inter-node and CLI tool communication
>> > Node: rabbit at controller03, interface: 183.81.13.229, port: 5672,
>> protocol:
>> > amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>> >
>> > Feature flags
>> >
>> > Flag: drop_unroutable_metric, state: enabled
>> > Flag: empty_basic_get_metric, state: enabled
>> > Flag: implicit_default_bindings, state: enabled
>> > Flag: maintenance_mode_status, state: enabled
>> > Flag: quorum_queue, state: enabled
>> > Flag: user_limits, state: enabled
>> > Flag: virtual_host_metadata, state: enabled
>> >
>> > I used ha_queues mode all
>> > But it is not better.
>> > Nguyen Huu Khoi
>> >
>> >
>> > On Tue, Oct 18, 2022 at 7:19 AM Nguyễn Hữu Khôi <
>> nguyenhuukhoinw at gmail.com>
>> > wrote:
>> >
>> >> Description
>> >> ===========
>> >> I set up 3 controllers and 3 compute nodes. My system cannot work well
>> >> when 1 rabbit node in cluster rabbitmq is down, cannot launch
>> instances. It
>> >> stucked at scheduling.
>> >>
>> >> Steps to reproduce
>> >> ===========
>> >> Openstack nodes point rabbit://node1:5672,node2:5672,node3:5672//
>> >> * Reboot 1 of 3 rabbitmq node.
>> >> * Create instances then it stucked at scheduling.
>> >>
>> >> Workaround
>> >> ===========
>> >> Point to rabbitmq VIP address. But We cannot share the load with this
>> >> solution. Please give me some suggestions. Thank you very much.
>> >> I did google and enabled system log's debug but I still cannot
>> understand
>> >> why.
>> >>
>> >> Nguyen Huu Khoi
>> >>
>>
>>
>>
>>
>>

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20221024/e52c7504/attachment.htm>


More information about the openstack-discuss mailing list