Hello.
I have checked that my Openstack is Xena and  oslo.messaging is 12.9.4. The problem still happens. I confirm that.

Nguyen Huu Khoi


On Mon, Oct 24, 2022 at 7:07 PM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Thank you for your response. This is exactly what I am facing. But I don't know how I can workaround it. Because I deploy with Kolla-Ansible Xena.. My current workaround is point  oslo.messaging to VIP. BTW, I am very glad when We know why it happened.

Nguyen Huu Khoi


On Mon, Oct 24, 2022 at 6:52 PM ROBERTO BARTZEN ACOSTA <roberto.acosta@luizalabs.com> wrote:
Hey folks,

I believe this problem is related to the maximum timeout in the pool loop, and was introduced in this thread [1] with this specific commit [2]. 


Corey Bryant proposed a workaround removing this commit [2] and building an alternate ubuntu pkg in this thread [3], but the root cause needs to be investigated because it was originally modified to solve the issue [1].


Regards,
Roberto



Em seg., 24 de out. de 2022 às 08:30, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> escreveu:
Hello. Sorry for that.
I just want to notice that both nova and cinder have this problem,
When diving to logs on both service I see:
ERROR oslo.messaging._drivers.impl_rabbit [-] [8634b511-7eee-4e50-8efd-b96d420e9914] AMQP server on [node was down]:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>

and 


2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback (most recent call last):
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return self._queues[msg_id].get(block=True, timeout=timeout)
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return waiter.wait()
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return get_hub().switch()
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return self.greenlet.switch()
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task _queue.Empty
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task During handling of the above exception, another exception occurred:
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback (most recent call last):
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_service/periodic_task.py", line 216, in run_periodic_tasks
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     task(self, context)
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/compute/manager.py", line 9716, in _sync_power_states
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     db_instances = objects.InstanceList.get_by_host(context, self.host,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_versionedobjects/base.py", line 175, in wrapper
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result = cls.indirection_api.object_class_action_versions(
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return cctxt.call(context, 'object_class_action_versions',
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result = self.transport._send(
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return self._driver.send(target, ctxt, message,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return self._send(target, ctxt, message, wait_for_reply, timeout,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 678, in _send
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result = self._waiter.wait(msg_id, timeout,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in wait
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     message = self.waiters.get(msg_id, timeout=timeout)
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 443, in get
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     raise oslo_messaging.MessagingTimeout(
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID c8a676a9709242908dcff97046d7976d

*** I use cluster rabbitmq with ha-policy for exchange and queue. These logs are gone when I restart cinder and nova services.



Nguyen Huu Khoi


On Mon, Oct 24, 2022 at 5:42 PM Eugen Block <eblock@nde.ag> wrote:
You don't need to create a new thread with the same issue.
Do the rabbitmq logs reveal anything? We create a cluster within 
rabbitmq and the output looks like this:

---snip---
control01:~ # rabbitmqctl cluster_status
Cluster status of node rabbit@control01 ...
Basics

Cluster name: rabbit@rabbitmq-cluster

Disk Nodes

rabbit@control01
rabbit@control02
rabbit@control03

Running Nodes

rabbit@control01
rabbit@control02
rabbit@control03

Versions

rabbit@control01: RabbitMQ 3.8.3 on Erlang 22.2.7
rabbit@control02: RabbitMQ 3.8.3 on Erlang 22.2.7
rabbit@control03: RabbitMQ 3.8.3 on Erlang 22.2.7
---snip---

During failover it's not unexpected that a message gets lost, but it 
should be resent, I believe. How is your openstack deployed?


Zitat von Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com>:

> Hello.
> 2 remain nodes still running, here is my output:
> Basics
>
> Cluster name: rabbit@controller01
>
> Disk Nodes
>
> rabbit@controller01
> rabbit@controller02
> rabbit@controller03
>
> Running Nodes
>
> rabbit@controller01
> rabbit@controller03
>
> Versions
>
> rabbit@controller01: RabbitMQ 3.8.35 on Erlang 23.3.4.18
> rabbit@controller03: RabbitMQ 3.8.35 on Erlang 23.3.4.18
>
> Maintenance status
>
> Node: rabbit@controller01, status: not under maintenance
> Node: rabbit@controller03, status: not under maintenance
>
> Alarms
>
> (none)
>
> Network Partitions
>
> (none)
>
> Listeners
>
> Node: rabbit@controller01, interface: [::], port: 15672, protocol: http,
> purpose: HTTP API
> Node: rabbit@controller01, interface: 183.81.13.227, port: 25672, protocol:
> clustering, purpose: inter-node and CLI tool communication
> Node: rabbit@controller01, interface: 183.81.13.227, port: 5672, protocol:
> amqp, purpose: AMQP 0-9-1 and AMQP 1.0
> Node: rabbit@controller03, interface: [::], port: 15672, protocol: http,
> purpose: HTTP API
> Node: rabbit@controller03, interface: 183.81.13.229, port: 25672, protocol:
> clustering, purpose: inter-node and CLI tool communication
> Node: rabbit@controller03, interface: 183.81.13.229, port: 5672, protocol:
> amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>
> Feature flags
>
> Flag: drop_unroutable_metric, state: enabled
> Flag: empty_basic_get_metric, state: enabled
> Flag: implicit_default_bindings, state: enabled
> Flag: maintenance_mode_status, state: enabled
> Flag: quorum_queue, state: enabled
> Flag: user_limits, state: enabled
> Flag: virtual_host_metadata, state: enabled
>
> I used ha_queues mode all
> But it is not better.
> Nguyen Huu Khoi
>
>
> On Tue, Oct 18, 2022 at 7:19 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com>
> wrote:
>
>> Description
>> ===========
>> I set up 3 controllers and 3 compute nodes. My system cannot work well
>> when 1 rabbit node in cluster rabbitmq is down, cannot launch instances. It
>> stucked at scheduling.
>>
>> Steps to reproduce
>> ===========
>> Openstack nodes point rabbit://node1:5672,node2:5672,node3:5672//
>> * Reboot 1 of 3 rabbitmq node.
>> * Create instances then it stucked at scheduling.
>>
>> Workaround
>> ===========
>> Point to rabbitmq VIP address. But We cannot share the load with this
>> solution. Please give me some suggestions. Thank you very much.
>> I did google and enabled system log's debug but I still cannot understand
>> why.
>>
>> Nguyen Huu Khoi
>>






‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’.

 ‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.