Openstack cluster cannot create instances when 1 of 3 rabbitmq cluster node down

Nguyễn Hữu Khôi nguyenhuukhoinw at gmail.com
Mon Oct 24 13:19:01 UTC 2022


Hello.
I have checked that my Openstack is Xena and  oslo.messaging is 12.9.4. The
problem still happens. I confirm that.

Nguyen Huu Khoi


On Mon, Oct 24, 2022 at 7:07 PM Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com>
wrote:

> Thank you for your response. This is exactly what I am facing. But I don't
> know how I can workaround it. Because I deploy with Kolla-Ansible Xena.. My
> current workaround is point  oslo.messaging to VIP. BTW, I am very glad
> when We know why it happened.
>
> Nguyen Huu Khoi
>
>
> On Mon, Oct 24, 2022 at 6:52 PM ROBERTO BARTZEN ACOSTA <
> roberto.acosta at luizalabs.com> wrote:
>
>> Hey folks,
>>
>> I believe this problem is related to the maximum timeout in the pool
>> loop, and was introduced in this thread [1] with this specific commit [2].
>>
>> [1] https://bugs.launchpad.net/oslo.messaging/+bug/1935864
>> [2]
>> https://opendev.org/openstack/oslo.messaging/commit/bdcf915e788bb368774e5462ccc15e6f5b7223d7
>>
>> Corey Bryant proposed a workaround removing this commit [2] and building
>> an alternate ubuntu pkg in this thread [3], but the root cause needs to be
>> investigated because it was originally modified to solve the issue [1].
>>
>> [3]
>> https://bugs.launchpad.net/ubuntu/jammy/+source/python-oslo.messaging/+bug/1993149
>>
>> Regards,
>> Roberto
>>
>>
>>
>> Em seg., 24 de out. de 2022 às 08:30, Nguyễn Hữu Khôi <
>> nguyenhuukhoinw at gmail.com> escreveu:
>>
>>> Hello. Sorry for that.
>>> I just want to notice that both nova and cinder have this problem,
>>> When diving to logs on both service I see:
>>> ERROR oslo.messaging._drivers.impl_rabbit [-]
>>> [8634b511-7eee-4e50-8efd-b96d420e9914] AMQP server on [node was down]:5672
>>> is unreachable: <RecoverableConnectionError: unknown error>. Trying again
>>> in 1 seconds.: amqp.exceptions.RecoverableConnectionError:
>>> <RecoverableConnectionError: unknown error>
>>>
>>> and
>>>
>>>
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback
>>> (most recent call last):
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
>>> line 441, in get
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
>>> self._queues[msg_id].get(block=True, timeout=timeout)
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line
>>> 322, in get
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
>>> waiter.wait()
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line
>>> 141, in wait
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
>>> get_hub().switch()
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/hubs/hub.py",
>>> line 313, in switch2022-10-24 14:23:01.945 7 ERROR
>>> oslo_service.periodic_task     return self.greenlet.switch()
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task _queue.Empty
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task During
>>> handling of the above exception, another exception occurred:
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback
>>> (most recent call last):
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_service/periodic_task.py",
>>> line 216, in run_periodic_tasks
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
>>> task(self, context)
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/compute/manager.py",
>>> line 9716, in _sync_power_states
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
>>> db_instances = objects.InstanceList.get_by_host(context, self.host,
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_versionedobjects/base.py",
>>> line 175, in wrapper
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
>>> cls.indirection_api.object_class_action_versions(
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/rpcapi.py",
>>> line 240, in object_class_action_versions
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
>>> cctxt.call(context, 'object_class_action_versions',
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py",
>>> line 189, in call
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
>>> self.transport._send(
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py",
>>> line 123, in _send
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
>>> self._driver.send(target, ctxt, message,
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
>>> line 689, in send
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
>>> self._send(target, ctxt, message, wait_for_reply, timeout,
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
>>> line 678, in _send
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
>>> self._waiter.wait(msg_id, timeout,
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
>>> line 567, in wait
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     message =
>>> self.waiters.get(msg_id, timeout=timeout)
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
>>> "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
>>> line 443, in get
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     raise
>>> oslo_messaging.MessagingTimeout(
>>> 2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
>>> oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply
>>> to message ID c8a676a9709242908dcff97046d7976d
>>>
>>> *** I use cluster rabbitmq with ha-policy for exchange and queue. These
>>> logs are gone when I restart cinder and nova services.
>>>
>>>
>>>
>>> Nguyen Huu Khoi
>>>
>>>
>>> On Mon, Oct 24, 2022 at 5:42 PM Eugen Block <eblock at nde.ag> wrote:
>>>
>>>> You don't need to create a new thread with the same issue.
>>>> Do the rabbitmq logs reveal anything? We create a cluster within
>>>> rabbitmq and the output looks like this:
>>>>
>>>> ---snip---
>>>> control01:~ # rabbitmqctl cluster_status
>>>> Cluster status of node rabbit at control01 ...
>>>> Basics
>>>>
>>>> Cluster name: rabbit at rabbitmq-cluster
>>>>
>>>> Disk Nodes
>>>>
>>>> rabbit at control01
>>>> rabbit at control02
>>>> rabbit at control03
>>>>
>>>> Running Nodes
>>>>
>>>> rabbit at control01
>>>> rabbit at control02
>>>> rabbit at control03
>>>>
>>>> Versions
>>>>
>>>> rabbit at control01: RabbitMQ 3.8.3 on Erlang 22.2.7
>>>> rabbit at control02: RabbitMQ 3.8.3 on Erlang 22.2.7
>>>> rabbit at control03: RabbitMQ 3.8.3 on Erlang 22.2.7
>>>> ---snip---
>>>>
>>>> During failover it's not unexpected that a message gets lost, but it
>>>> should be resent, I believe. How is your openstack deployed?
>>>>
>>>>
>>>> Zitat von Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com>:
>>>>
>>>> > Hello.
>>>> > 2 remain nodes still running, here is my output:
>>>> > Basics
>>>> >
>>>> > Cluster name: rabbit at controller01
>>>> >
>>>> > Disk Nodes
>>>> >
>>>> > rabbit at controller01
>>>> > rabbit at controller02
>>>> > rabbit at controller03
>>>> >
>>>> > Running Nodes
>>>> >
>>>> > rabbit at controller01
>>>> > rabbit at controller03
>>>> >
>>>> > Versions
>>>> >
>>>> > rabbit at controller01: RabbitMQ 3.8.35 on Erlang 23.3.4.18
>>>> > rabbit at controller03: RabbitMQ 3.8.35 on Erlang 23.3.4.18
>>>> >
>>>> > Maintenance status
>>>> >
>>>> > Node: rabbit at controller01, status: not under maintenance
>>>> > Node: rabbit at controller03, status: not under maintenance
>>>> >
>>>> > Alarms
>>>> >
>>>> > (none)
>>>> >
>>>> > Network Partitions
>>>> >
>>>> > (none)
>>>> >
>>>> > Listeners
>>>> >
>>>> > Node: rabbit at controller01, interface: [::], port: 15672, protocol:
>>>> http,
>>>> > purpose: HTTP API
>>>> > Node: rabbit at controller01, interface: 183.81.13.227, port: 25672,
>>>> protocol:
>>>> > clustering, purpose: inter-node and CLI tool communication
>>>> > Node: rabbit at controller01, interface: 183.81.13.227, port: 5672,
>>>> protocol:
>>>> > amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>>>> > Node: rabbit at controller03, interface: [::], port: 15672, protocol:
>>>> http,
>>>> > purpose: HTTP API
>>>> > Node: rabbit at controller03, interface: 183.81.13.229, port: 25672,
>>>> protocol:
>>>> > clustering, purpose: inter-node and CLI tool communication
>>>> > Node: rabbit at controller03, interface: 183.81.13.229, port: 5672,
>>>> protocol:
>>>> > amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>>>> >
>>>> > Feature flags
>>>> >
>>>> > Flag: drop_unroutable_metric, state: enabled
>>>> > Flag: empty_basic_get_metric, state: enabled
>>>> > Flag: implicit_default_bindings, state: enabled
>>>> > Flag: maintenance_mode_status, state: enabled
>>>> > Flag: quorum_queue, state: enabled
>>>> > Flag: user_limits, state: enabled
>>>> > Flag: virtual_host_metadata, state: enabled
>>>> >
>>>> > I used ha_queues mode all
>>>> > But it is not better.
>>>> > Nguyen Huu Khoi
>>>> >
>>>> >
>>>> > On Tue, Oct 18, 2022 at 7:19 AM Nguyễn Hữu Khôi <
>>>> nguyenhuukhoinw at gmail.com>
>>>> > wrote:
>>>> >
>>>> >> Description
>>>> >> ===========
>>>> >> I set up 3 controllers and 3 compute nodes. My system cannot work
>>>> well
>>>> >> when 1 rabbit node in cluster rabbitmq is down, cannot launch
>>>> instances. It
>>>> >> stucked at scheduling.
>>>> >>
>>>> >> Steps to reproduce
>>>> >> ===========
>>>> >> Openstack nodes point rabbit://node1:5672,node2:5672,node3:5672//
>>>> >> * Reboot 1 of 3 rabbitmq node.
>>>> >> * Create instances then it stucked at scheduling.
>>>> >>
>>>> >> Workaround
>>>> >> ===========
>>>> >> Point to rabbitmq VIP address. But We cannot share the load with this
>>>> >> solution. Please give me some suggestions. Thank you very much.
>>>> >> I did google and enabled system log's debug but I still cannot
>>>> understand
>>>> >> why.
>>>> >>
>>>> >> Nguyen Huu Khoi
>>>> >>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> *‘Esta mensagem é direcionada apenas para os endereços constantes no
>> cabeçalho inicial. Se você não está listado nos endereços constantes no
>> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
>> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
>> imediatamente anuladas e proibidas’.*
>>
>>  *‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para
>> assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não
>> poderá aceitar a responsabilidade por quaisquer perdas ou danos causados
>> por esse e-mail ou por seus anexos’.*
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20221024/9d3a5c24/attachment-0001.htm>


More information about the openstack-discuss mailing list