Openstack cluster cannot create instances when 1 of 3 rabbitmq cluster node down

Nguyễn Hữu Khôi nguyenhuukhoinw at gmail.com
Mon Oct 24 11:23:35 UTC 2022


Hello. Sorry for that.
I just want to notice that both nova and cinder have this problem,
When diving to logs on both service I see:
ERROR oslo.messaging._drivers.impl_rabbit [-]
[8634b511-7eee-4e50-8efd-b96d420e9914] AMQP server on [node was down]:5672
is unreachable: <RecoverableConnectionError: unknown error>. Trying again
in 1 seconds.: amqp.exceptions.RecoverableConnectionError:
<RecoverableConnectionError: unknown error>

and


2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback (most
recent call last):
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
line 441, in get
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
self._queues[msg_id].get(block=True, timeout=timeout)
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line
322, in get
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
waiter.wait()
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line
141, in wait
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
get_hub().switch()
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/hubs/hub.py",
line 313, in switch2022-10-24 14:23:01.945 7 ERROR
oslo_service.periodic_task     return self.greenlet.switch()
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task _queue.Empty
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task During handling
of the above exception, another exception occurred:
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task Traceback (most
recent call last):
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_service/periodic_task.py",
line 216, in run_periodic_tasks
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     task(self,
context)
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/nova/compute/manager.py",
line 9716, in _sync_power_states
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     db_instances
= objects.InstanceList.get_by_host(context, self.host,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_versionedobjects/base.py",
line 175, in wrapper
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
cls.indirection_api.object_class_action_versions(
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/rpcapi.py",
line 240, in object_class_action_versions
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
cctxt.call(context, 'object_class_action_versions',
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py",
line 189, in call
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
self.transport._send(
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py",
line 123, in _send
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
self._driver.send(target, ctxt, message,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
line 689, in send
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     return
self._send(target, ctxt, message, wait_for_reply, timeout,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
line 678, in _send
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     result =
self._waiter.wait(msg_id, timeout,
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
line 567, in wait
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     message =
self.waiters.get(msg_id, timeout=timeout)
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task   File
"/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py",
line 443, in get
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task     raise
oslo_messaging.MessagingTimeout(
2022-10-24 14:23:01.945 7 ERROR oslo_service.periodic_task
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply
to message ID c8a676a9709242908dcff97046d7976d

*** I use cluster rabbitmq with ha-policy for exchange and queue. These
logs are gone when I restart cinder and nova services.



Nguyen Huu Khoi


On Mon, Oct 24, 2022 at 5:42 PM Eugen Block <eblock at nde.ag> wrote:

> You don't need to create a new thread with the same issue.
> Do the rabbitmq logs reveal anything? We create a cluster within
> rabbitmq and the output looks like this:
>
> ---snip---
> control01:~ # rabbitmqctl cluster_status
> Cluster status of node rabbit at control01 ...
> Basics
>
> Cluster name: rabbit at rabbitmq-cluster
>
> Disk Nodes
>
> rabbit at control01
> rabbit at control02
> rabbit at control03
>
> Running Nodes
>
> rabbit at control01
> rabbit at control02
> rabbit at control03
>
> Versions
>
> rabbit at control01: RabbitMQ 3.8.3 on Erlang 22.2.7
> rabbit at control02: RabbitMQ 3.8.3 on Erlang 22.2.7
> rabbit at control03: RabbitMQ 3.8.3 on Erlang 22.2.7
> ---snip---
>
> During failover it's not unexpected that a message gets lost, but it
> should be resent, I believe. How is your openstack deployed?
>
>
> Zitat von Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com>:
>
> > Hello.
> > 2 remain nodes still running, here is my output:
> > Basics
> >
> > Cluster name: rabbit at controller01
> >
> > Disk Nodes
> >
> > rabbit at controller01
> > rabbit at controller02
> > rabbit at controller03
> >
> > Running Nodes
> >
> > rabbit at controller01
> > rabbit at controller03
> >
> > Versions
> >
> > rabbit at controller01: RabbitMQ 3.8.35 on Erlang 23.3.4.18
> > rabbit at controller03: RabbitMQ 3.8.35 on Erlang 23.3.4.18
> >
> > Maintenance status
> >
> > Node: rabbit at controller01, status: not under maintenance
> > Node: rabbit at controller03, status: not under maintenance
> >
> > Alarms
> >
> > (none)
> >
> > Network Partitions
> >
> > (none)
> >
> > Listeners
> >
> > Node: rabbit at controller01, interface: [::], port: 15672, protocol: http,
> > purpose: HTTP API
> > Node: rabbit at controller01, interface: 183.81.13.227, port: 25672,
> protocol:
> > clustering, purpose: inter-node and CLI tool communication
> > Node: rabbit at controller01, interface: 183.81.13.227, port: 5672,
> protocol:
> > amqp, purpose: AMQP 0-9-1 and AMQP 1.0
> > Node: rabbit at controller03, interface: [::], port: 15672, protocol: http,
> > purpose: HTTP API
> > Node: rabbit at controller03, interface: 183.81.13.229, port: 25672,
> protocol:
> > clustering, purpose: inter-node and CLI tool communication
> > Node: rabbit at controller03, interface: 183.81.13.229, port: 5672,
> protocol:
> > amqp, purpose: AMQP 0-9-1 and AMQP 1.0
> >
> > Feature flags
> >
> > Flag: drop_unroutable_metric, state: enabled
> > Flag: empty_basic_get_metric, state: enabled
> > Flag: implicit_default_bindings, state: enabled
> > Flag: maintenance_mode_status, state: enabled
> > Flag: quorum_queue, state: enabled
> > Flag: user_limits, state: enabled
> > Flag: virtual_host_metadata, state: enabled
> >
> > I used ha_queues mode all
> > But it is not better.
> > Nguyen Huu Khoi
> >
> >
> > On Tue, Oct 18, 2022 at 7:19 AM Nguyễn Hữu Khôi <
> nguyenhuukhoinw at gmail.com>
> > wrote:
> >
> >> Description
> >> ===========
> >> I set up 3 controllers and 3 compute nodes. My system cannot work well
> >> when 1 rabbit node in cluster rabbitmq is down, cannot launch
> instances. It
> >> stucked at scheduling.
> >>
> >> Steps to reproduce
> >> ===========
> >> Openstack nodes point rabbit://node1:5672,node2:5672,node3:5672//
> >> * Reboot 1 of 3 rabbitmq node.
> >> * Create instances then it stucked at scheduling.
> >>
> >> Workaround
> >> ===========
> >> Point to rabbitmq VIP address. But We cannot share the load with this
> >> solution. Please give me some suggestions. Thank you very much.
> >> I did google and enabled system log's debug but I still cannot
> understand
> >> why.
> >>
> >> Nguyen Huu Khoi
> >>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20221024/3182ba4d/attachment-0001.htm>


More information about the openstack-discuss mailing list