3) Networking issues like mismatched MTU

My MTU (between nodes ) is 9000…. 

I believe my problem is the  MTU.

I modified /etc/kolla/config/neutron.conf and /etc/kolla/config/neutron/ml2_conf.ini.conf
then kolla-ansible -i multinode reconfigures

(case 1 here: https://docs.openstack.org/newton/networking-guide/config-mtu.html)

I test again everything and functions that did not work work again but not all....

For example, instances get an ip through dhcp but can't ping the router, but on some networks it works.
However, before the reboot of the servers, I had not had a problem with the MTU of 9000.

I'm going back to a 1500 MTU on Monday on site.

Thank you Eric!!!

Franck VEDEL




Le 12 nov. 2022 à 15:10, Erik McCormick <emccormick@cirrusseven.com> a écrit :



On Sat, Nov 12, 2022 at 3:08 AM Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr> wrote:
Bonjour !

Output of the command

Cluster status of node rabbit@iut1r-srv-ops01-i01 ...
Basics

Disk Nodes
rabbit@iut1r-srv-ops01-i01
rabbit@iut1r-srv-ops02-i01

Running Nodes
rabbit@iut1r-srv-ops01-i01
rabbit@iut1r-srv-ops02-i01

Versions
rabbit@iut1r-srv-ops01-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2
rabbit@iut1r-srv-ops02-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2

Maintenance status
Node: rabbit@iut1r-srv-ops01-i01, status: not under maintenance
Node: rabbit@iut1r-srv-ops02-i01, status: not under maintenance

Alarms
(none)

Network Partitions
(none)

Listeners
Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags
Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled

So… nothing strange for me.

All containers are healthy nom (after delete rabbitmq and rebuild rabbitmq).


in addition to dhcp, communications on the network do not work.
If I create an instance, it has no ip address by dhcp.
If I give her a static ip, she can't reach the router.
If I create another instance, with another static ip, they don't communicate with each other.
And they can't ping the router (or routers, I put 2, 1 on each of my 2 external networks)

There are some errors in rabbitmq…..log:
2022-11-12 08:53:37.155542+01:00 [error] <0.16179.2> missed heartbeats from client, timeout: 60s
2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> closing AMQP connection <0.17357.2> (10.0.5.109:37532 -> 10.0.5.109:5672 - mod_wsgi:43:e50d8e69-7c76-4198-877c-c807e0a180d8):
2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> missed heartbeats from client, timeout: 60s

There are some errors also in neutron-l3-agent.log
2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task     message = self.waiters.get(msg_id, timeout=timeout)
2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get
2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task     'to message ID %s' % msg_id)
2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 297cacfadd764562bf09a1c5daf61958

Also in neutron-dhcp-agent.log
2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get
2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6f1d9d0c51ac4d89b9c889ca273f40a0

A lot of errors in neutron-metadata.log
2022-11-11 22:01:44.152 43 ERROR oslo.messaging._drivers.impl_rabbit [-] [d7902e2c-eba9-40e4-b872-40e7ba7a39ec] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-11-11 22:01:44.226 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [028872e4-fcd1-4de5-b20c-8c5541e3c77f] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>


timeout …. waiting…. unreachable…. connectionerror…. 

Something is wrong, but I think it’s very difficult to find the problem. To difficult for me.
« nc -v » works.

There are several things that can cause issues with Rabbit, or with services sending messages. Rabbit itself is not always to blame. Things I've seen cause issues before include:

1) Time not being in sync on all systems (covered that earlier)
2) DNS (it's always DNS, right?)
3) Networking issues like mismatched MTU
4) Nova being configured for a Ceph backend, but timing out trying to talk to the cluster (messages would expire while Nova waited on it)

 
I do not know what to do.
I can lose all data (networks, instances, volumes, etc). I can start again on a new config
Do I do it with kolla-ansible -i multinode destroy?

Yeah, just do kolla-ansible -i multinode destroy after backing up your kolla configs.

Before switching to Yoga, I had a cluster under Xena. I kept my configuration and a venv (python) with koll-ansible for Xena.
Am I going back to this version? How without doing stupid things?

I can't see any good reason to roll back to Xena. Yoga should be fine. 

Changing should be as simple as swapping your VENV, and using your Xena globals.yml, passwords.yml, inventory, and any other custom configs you had for that version.  


Thanks a lot.

Franck VEDEL



Le 11 nov. 2022 à 23:33, Laurent Dumont <laurentfdumont@gmail.com> a écrit :

docker exec -it rabbitmq rabbitmqctl cluster_status