3) Networking issues like mismatched MTU
My MTU (between nodes ) is 9000…. I believe my problem is the MTU. I modified /etc/kolla/config/neutron.conf and /etc/kolla/config/neutron/ml2_conf.ini.conf then kolla-ansible -i multinode reconfigures (case 1 here: https://docs.openstack.org/newton/networking-guide/config-mtu.html <https://docs.openstack.org/newton/networking-guide/config-mtu.html>) I test again everything and functions that did not work work again but not all.... For example, instances get an ip through dhcp but can't ping the router, but on some networks it works. However, before the reboot of the servers, I had not had a problem with the MTU of 9000. I'm going back to a 1500 MTU on Monday on site. Thank you Eric!!! Franck VEDEL
Le 12 nov. 2022 à 15:10, Erik McCormick <emccormick@cirrusseven.com> a écrit :
On Sat, Nov 12, 2022 at 3:08 AM Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr <mailto:franck.vedel@univ-grenoble-alpes.fr>> wrote: Bonjour !
Output of the command
Cluster status of node rabbit@iut1r-srv-ops01-i01 ... Basics Cluster name: rabbit@iut1r-srv-ops01-i01.u-ga.fr <mailto:rabbit@iut1r-srv-ops01-i01.u-ga.fr>
Disk Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01
Running Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01
Versions rabbit@iut1r-srv-ops01-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2 rabbit@iut1r-srv-ops02-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2
Maintenance status Node: rabbit@iut1r-srv-ops01-i01, status: not under maintenance Node: rabbit@iut1r-srv-ops02-i01, status: not under maintenance
Alarms (none)
Network Partitions (none)
Listeners Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Feature flags Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: maintenance_mode_status, state: enabled Flag: quorum_queue, state: enabled Flag: stream_queue, state: enabled Flag: user_limits, state: enabled Flag: virtual_host_metadata, state: enabled
So… nothing strange for me.
All containers are healthy nom (after delete rabbitmq and rebuild rabbitmq).
in addition to dhcp, communications on the network do not work. If I create an instance, it has no ip address by dhcp. If I give her a static ip, she can't reach the router. If I create another instance, with another static ip, they don't communicate with each other. And they can't ping the router (or routers, I put 2, 1 on each of my 2 external networks)
There are some errors in rabbitmq…..log: 2022-11-12 08:53:37.155542+01:00 [error] <0.16179.2> missed heartbeats from client, timeout: 60s 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> closing AMQP connection <0.17357.2> (10.0.5.109:37532 <http://10.0.5.109:37532/> -> 10.0.5.109:5672 <http://10.0.5.109:5672/> - mod_wsgi:43:e50d8e69-7c76-4198-877c-c807e0a180d8): 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> missed heartbeats from client, timeout: 60s
There are some errors also in neutron-l3-agent.log 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task 'to message ID %s' % msg_id) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 297cacfadd764562bf09a1c5daf61958
Also in neutron-dhcp-agent.log 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6f1d9d0c51ac4d89b9c889ca273f40a0
A lot of errors in neutron-metadata.log 2022-11-11 22:01:44.152 43 ERROR oslo.messaging._drivers.impl_rabbit [-] [d7902e2c-eba9-40e4-b872-40e7ba7a39ec] AMQP server on 10.0.5.109:5672 <http://10.0.5.109:5672/> is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2022-11-11 22:01:44.226 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [028872e4-fcd1-4de5-b20c-8c5541e3c77f] AMQP server on 10.0.5.109:5672 <http://10.0.5.109:5672/> is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
timeout …. waiting…. unreachable…. connectionerror….
Something is wrong, but I think it’s very difficult to find the problem. To difficult for me. « nc -v » works.
There are several things that can cause issues with Rabbit, or with services sending messages. Rabbit itself is not always to blame. Things I've seen cause issues before include:
1) Time not being in sync on all systems (covered that earlier) 2) DNS (it's always DNS, right?) 3) Networking issues like mismatched MTU 4) Nova being configured for a Ceph backend, but timing out trying to talk to the cluster (messages would expire while Nova waited on it)
I do not know what to do. I can lose all data (networks, instances, volumes, etc). I can start again on a new config Do I do it with kolla-ansible -i multinode destroy?
Yeah, just do kolla-ansible -i multinode destroy after backing up your kolla configs.
Before switching to Yoga, I had a cluster under Xena. I kept my configuration and a venv (python) with koll-ansible for Xena. Am I going back to this version? How without doing stupid things?
I can't see any good reason to roll back to Xena. Yoga should be fine.
Changing should be as simple as swapping your VENV, and using your Xena globals.yml, passwords.yml, inventory, and any other custom configs you had for that version.
Thanks a lot.
Franck VEDEL
Le 11 nov. 2022 à 23:33, Laurent Dumont <laurentfdumont@gmail.com <mailto:laurentfdumont@gmail.com>> a écrit :
docker exec -it rabbitmq rabbitmqctl cluster_status