Hello, yesterday I sent a desperate message about the functioning of my Openstack (kolla-ansible, yoga, centos8Stream). The situation is incomprehensible. For some accounts in some projects, everything works (creation of networks, subnets, dhcp configuration, instances, everything works). For other accounts, everything works except having an IP address per dhcp on a subnet. The servers have restarted following a power outage. I managed to put everything back together except those DHCP issues. As I can start from an empty configuration, how to reset everything ... for example with a command: kolla-ansible -i multinode ____________ Or how to fix my configuration. It seems very complicated to me. I would like some opinions if possible. Thanks in advance Franck
On Thu, Nov 10, 2022 at 11:52 AM Franck VEDEL < franck.vedel@univ-grenoble-alpes.fr> wrote:
Hello,
yesterday I sent a desperate message about the functioning of my Openstack (kolla-ansible, yoga, centos8Stream). The situation is incomprehensible. For some accounts in some projects, everything works (creation of networks, subnets, dhcp configuration, instances, everything works). For other accounts, everything works except having an IP address per dhcp on a subnet. The servers have restarted following a power outage.
I managed to put everything back together except those DHCP issues.
As I can start from an empty configuration, how to reset everything ... for example with a command: kolla-ansible -i multinode ____________
Or how to fix my configuration. It seems very complicated to me. I would like some opinions if possible. Thanks in advance
Franck
Are you asking how to completely zero out your entire cluster and rebuild it? That seems a bit drastic. kolla-ansible destroy will nuke everything. Take a backup of /etc/kolla (or wherever your inventory / globals.yml / passwords/yml is) first. Older versions removed some things there when running destroy and I can't recall when / if that changed. How many controllers do you have? Are you using OVS, OVN, or something else? Are you using L3-HA? DVR? Did all nodes have to be rebooted? If not, then which ones? Have you confirmed there are no dead containers on any controllers? ( docker ps -a ) Have you looked in logs for ERROR messages? In particular: neutron-server.log, neutron-dhcp-agent.log, nova-api.log, and nova-compute.log ? Strange things happen when time is out of sync. Verify all the nodes synced properly to an NTP server. Big symptom of this is 'openstack hypervisor list' will show hosts going up and down every few seconds.
Thanks for your help, really. My cluster: 2 controllers nodes, OVS, L3-HA. All nodes had to be rebooted All is working for example with external networks (so dhcp on external networks). There are no dead containers, all seems ok. I try to create a new instance on a L3 network. No ERROR in neutron*.log. The only error is nova-api.log: Example: 2022-11-11 08:45:54.452 42 ERROR oslo.messaging._drivers.impl_rabbit [-] [8b6fd776-f096-4c8a-927e-88225a3adb43] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> But on the first node (10.0.5.109 on the internal network) « netstat -atnp |wc-l » ———>>> 505 connections So…. if I backup /etc/kolla, my glance images, my configuration files… if a do « koll-ansible destroy », is next step « kolla-ansible bootstraps…. » and preaches, and deploy, or directly deploy ? What’s the difference with cleanup-containers ? I use this openstack cluster for my students, I have a month to get it working again. I could reinstall everything (and change the operating system) but I don't have time for that. So I can lose all the users data, if I have my glance images, my flavors, the configuration to hang the ldap, the certificates, I think it will be ok. Franck VEDEL
Are you asking how to completely zero out your entire cluster and rebuild it? That seems a bit drastic.
kolla-ansible destroy will nuke everything. Take a backup of /etc/kolla (or wherever your inventory / globals.yml / passwords/yml is) first. Older versions removed some things there when running destroy and I can't recall when / if that changed.
How many controllers do you have?
Are you using OVS, OVN, or something else?
Are you using L3-HA? DVR?
Did all nodes have to be rebooted? If not, then which ones?
Have you confirmed there are no dead containers on any controllers? ( docker ps -a )
Have you looked in logs for ERROR messages? In particular: neutron-server.log, neutron-dhcp-agent.log, nova-api.log, and nova-compute.log ?
Strange things happen when time is out of sync. Verify all the nodes synced properly to an NTP server. Big symptom of this is 'openstack hypervisor list' will show hosts going up and down every few seconds.
On Fri, Nov 11, 2022 at 3:05 AM Franck VEDEL < franck.vedel@univ-grenoble-alpes.fr> wrote:
Thanks for your help, really. My cluster: 2 controllers nodes, OVS, L3-HA. All nodes had to be rebooted All is working for example with external networks (so dhcp on external networks). There are no dead containers, all seems ok.
I try to create a new instance on a L3 network. No ERROR in neutron*.log. The only error is nova-api.log:
Example: 2022-11-11 08:45:54.452 42 ERROR oslo.messaging._drivers.impl_rabbit [-] [8b6fd776-f096-4c8a-927e-88225a3adb43] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
But on the first node (10.0.5.109 on the internal network) « netstat -atnp |wc-l » ———>>> 505 connections
Sounds to me like Rabbit is broken. This could also be an issue with NTP
which I asked about earlier. Did you confirm your systems are all correctly synced to the same time source? You can check the status of rabbit on each control node with: docker exec -it rabbitmq rabbitmqctl cluster_status Output should show the same on both of your controllers. If not, restart your rabbit containers. If they won't come back properly, you could destroy and redeploy just those two containers l On both controllers do: docker rm rabbitmq docker volume rm rabbitmq Then kolla-ansible --tags rabbitmq deploy So…. if I backup /etc/kolla, my glance images, my configuration files…
if a do « koll-ansible destroy », is next step « kolla-ansible bootstraps…. » and preaches, and deploy, or directly deploy ?
What’s the difference with cleanup-containers ?
You don't need to bootstrap again. That just installs prerequisites which won't get removed from the destroy. Just go right to doing kolla-ansible deploy again. Do remember this will give you a brand new Openstack with nothing preserved from before.
cleanup-containers alone may leave behind some docker tweaks that Neutron needs. It probably doesn't matter if you're going to just redeploy the same configuration though so go ahead and use that instead. -Erik
I use this openstack cluster for my students, I have a month to get it working again. I could reinstall everything (and change the operating system) but I don't have time for that. So I can lose all the users data, if I have my glance images, my flavors, the configuration to hang the ldap, the certificates, I think it will be ok.
Franck VEDEL
Are you asking how to completely zero out your entire cluster and rebuild it? That seems a bit drastic.
kolla-ansible destroy will nuke everything. Take a backup of /etc/kolla (or wherever your inventory / globals.yml / passwords/yml is) first. Older versions removed some things there when running destroy and I can't recall when / if that changed.
How many controllers do you have?
Are you using OVS, OVN, or something else?
Are you using L3-HA? DVR?
Did all nodes have to be rebooted? If not, then which ones?
Have you confirmed there are no dead containers on any controllers? ( docker ps -a )
Have you looked in logs for ERROR messages? In particular: neutron-server.log, neutron-dhcp-agent.log, nova-api.log, and nova-compute.log ?
Strange things happen when time is out of sync. Verify all the nodes synced properly to an NTP server. Big symptom of this is 'openstack hypervisor list' will show hosts going up and down every few seconds.
Thanks for your help Erik. All is fine with NTP. Exactly the same result with " docker exec -it rabbitmq rabbitmqctl cluster_status" on the 2 nodes. I Will try this:
On both controllers do: docker rm rabbitmq docker volume rm rabbitmq
Then kolla-ansible --tags rabbitmq deploy
Franck VEDEL Dép. Réseaux Informatiques & Télécoms IUT1 - Univ GRENOBLE Alpes 0476824462 Stages, Alternance, Emploi.
Le 11 nov. 2022 à 16:59, Erik McCormick <emccormick@cirrusseven.com> a écrit :
On Fri, Nov 11, 2022 at 3:05 AM Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr <mailto:franck.vedel@univ-grenoble-alpes.fr>> wrote:
Thanks for your help, really. My cluster: 2 controllers nodes, OVS, L3-HA. All nodes had to be rebooted All is working for example with external networks (so dhcp on external networks). There are no dead containers, all seems ok.
I try to create a new instance on a L3 network. No ERROR in neutron*.log. The only error is nova-api.log:
Example: 2022-11-11 08:45:54.452 42 ERROR oslo.messaging._drivers.impl_rabbit [-] [8b6fd776-f096-4c8a-927e-88225a3adb43] AMQP server on 10.0.5.109:5672 <http://10.0.5.109:5672/> is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
But on the first node (10.0.5.109 on the internal network) « netstat -atnp |wc-l » ———>>> 505 connections
Sounds to me like Rabbit is broken. This could also be an issue with NTP which I asked about earlier. Did you confirm your systems are all correctly synced to the same time source?
You can check the status of rabbit on each control node with:
docker exec -it rabbitmq rabbitmqctl cluster_status
Output should show the same on both of your controllers. If not, restart your rabbit containers. If they won't come back properly, you could destroy and redeploy just those two containers l
On both controllers do: docker rm rabbitmq docker volume rm rabbitmq
Then kolla-ansible --tags rabbitmq deploy
So…. if I backup /etc/kolla, my glance images, my configuration files… if a do « koll-ansible destroy », is next step « kolla-ansible bootstraps…. » and preaches, and deploy, or directly deploy ?
What’s the difference with cleanup-containers ?
You don't need to bootstrap again. That just installs prerequisites which won't get removed from the destroy. Just go right to doing kolla-ansible deploy again. Do remember this will give you a brand new Openstack with nothing preserved from before.
cleanup-containers alone may leave behind some docker tweaks that Neutron needs. It probably doesn't matter if you're going to just redeploy the same configuration though so go ahead and use that instead.
-Erik
Salut Franck! Can you share the output of docker exec -it rabbitmq rabbitmqctl cluster_status? Can you "nc -v" from one of the compute nodes towards the controller nodes? Laurent On Fri, Nov 11, 2022 at 2:47 PM Franck VEDEL < franck.vedel@univ-grenoble-alpes.fr> wrote:
Thanks for your help Erik. All is fine with NTP. Exactly the same result with " docker exec -it rabbitmq rabbitmqctl cluster_status" on the 2 nodes.
I Will try this:
On both controllers do: docker rm rabbitmq docker volume rm rabbitmq
Then kolla-ansible --tags rabbitmq deploy
Franck VEDEL *Dép. Réseaux Informatiques & Télécoms* *IUT1 - Univ GRENOBLE Alpes* *0476824462* Stages, Alternance, Emploi.
Le 11 nov. 2022 à 16:59, Erik McCormick <emccormick@cirrusseven.com> a écrit :
On Fri, Nov 11, 2022 at 3:05 AM Franck VEDEL < franck.vedel@univ-grenoble-alpes.fr> wrote:
Thanks for your help, really. My cluster: 2 controllers nodes, OVS, L3-HA. All nodes had to be rebooted All is working for example with external networks (so dhcp on external networks). There are no dead containers, all seems ok.
I try to create a new instance on a L3 network. No ERROR in neutron*.log. The only error is nova-api.log:
Example: 2022-11-11 08:45:54.452 42 ERROR oslo.messaging._drivers.impl_rabbit [-] [8b6fd776-f096-4c8a-927e-88225a3adb43] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
But on the first node (10.0.5.109 on the internal network) « netstat -atnp |wc-l » ———>>> 505 connections
Sounds to me like Rabbit is broken. This could also be an issue with NTP
which I asked about earlier. Did you confirm your systems are all correctly synced to the same time source?
You can check the status of rabbit on each control node with:
docker exec -it rabbitmq rabbitmqctl cluster_status
Output should show the same on both of your controllers. If not, restart your rabbit containers. If they won't come back properly, you could destroy and redeploy just those two containers l
On both controllers do: docker rm rabbitmq docker volume rm rabbitmq
Then kolla-ansible --tags rabbitmq deploy
So…. if I backup /etc/kolla, my glance images, my configuration files…
if a do « koll-ansible destroy », is next step « kolla-ansible bootstraps…. » and preaches, and deploy, or directly deploy ?
What’s the difference with cleanup-containers ?
You don't need to bootstrap again. That just installs prerequisites which won't get removed from the destroy. Just go right to doing kolla-ansible deploy again. Do remember this will give you a brand new Openstack with nothing preserved from before.
cleanup-containers alone may leave behind some docker tweaks that Neutron needs. It probably doesn't matter if you're going to just redeploy the same configuration though so go ahead and use that instead.
-Erik
Bonjour ! Output of the command Cluster status of node rabbit@iut1r-srv-ops01-i01 ... Basics Cluster name: rabbit@iut1r-srv-ops01-i01.u-ga.fr Disk Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01 Running Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01 Versions rabbit@iut1r-srv-ops01-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2 rabbit@iut1r-srv-ops02-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2 Maintenance status Node: rabbit@iut1r-srv-ops01-i01, status: not under maintenance Node: rabbit@iut1r-srv-ops02-i01, status: not under maintenance Alarms (none) Network Partitions (none) Listeners Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Feature flags Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: maintenance_mode_status, state: enabled Flag: quorum_queue, state: enabled Flag: stream_queue, state: enabled Flag: user_limits, state: enabled Flag: virtual_host_metadata, state: enabled So… nothing strange for me. All containers are healthy nom (after delete rabbitmq and rebuild rabbitmq). in addition to dhcp, communications on the network do not work. If I create an instance, it has no ip address by dhcp. If I give her a static ip, she can't reach the router. If I create another instance, with another static ip, they don't communicate with each other. And they can't ping the router (or routers, I put 2, 1 on each of my 2 external networks) There are some errors in rabbitmq…..log: 2022-11-12 08:53:37.155542+01:00 [error] <0.16179.2> missed heartbeats from client, timeout: 60s 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> closing AMQP connection <0.17357.2> (10.0.5.109:37532 -> 10.0.5.109:5672 - mod_wsgi:43:e50d8e69-7c76-4198-877c-c807e0a180d8): 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> missed heartbeats from client, timeout: 60s There are some errors also in neutron-l3-agent.log 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task 'to message ID %s' % msg_id) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 297cacfadd764562bf09a1c5daf61958 Also in neutron-dhcp-agent.log 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6f1d9d0c51ac4d89b9c889ca273f40a0 A lot of errors in neutron-metadata.log 2022-11-11 22:01:44.152 43 ERROR oslo.messaging._drivers.impl_rabbit [-] [d7902e2c-eba9-40e4-b872-40e7ba7a39ec] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2022-11-11 22:01:44.226 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [028872e4-fcd1-4de5-b20c-8c5541e3c77f] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> timeout …. waiting…. unreachable…. connectionerror…. Something is wrong, but I think it’s very difficult to find the problem. To difficult for me. « nc -v » works. I do not know what to do. I can lose all data (networks, instances, volumes, etc). I can start again on a new config Do I do it with kolla-ansible -i multinode destroy? Before switching to Yoga, I had a cluster under Xena. I kept my configuration and a venv (python) with koll-ansible for Xena. Am I going back to this version? How without doing stupid things? Thanks a lot. Franck VEDEL
Le 11 nov. 2022 à 23:33, Laurent Dumont <laurentfdumont@gmail.com> a écrit :
docker exec -it rabbitmq rabbitmqctl cluster_status
On Sat, Nov 12, 2022 at 3:08 AM Franck VEDEL < franck.vedel@univ-grenoble-alpes.fr> wrote:
Bonjour !
Output of the command
Cluster status of node rabbit@iut1r-srv-ops01-i01 ... Basics Cluster name: rabbit@iut1r-srv-ops01-i01.u-ga.fr
Disk Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01
Running Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01
Versions rabbit@iut1r-srv-ops01-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2 rabbit@iut1r-srv-ops02-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2
Maintenance status Node: rabbit@iut1r-srv-ops01-i01, status: not under maintenance Node: rabbit@iut1r-srv-ops02-i01, status: not under maintenance
Alarms (none)
Network Partitions (none)
Listeners Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Feature flags Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: maintenance_mode_status, state: enabled Flag: quorum_queue, state: enabled Flag: stream_queue, state: enabled Flag: user_limits, state: enabled Flag: virtual_host_metadata, state: enabled
So… nothing strange for me.
All containers are healthy nom (after delete rabbitmq and rebuild rabbitmq).
in addition to dhcp, communications on the network do not work. If I create an instance, it has no ip address by dhcp. If I give her a static ip, she can't reach the router. If I create another instance, with another static ip, they don't communicate with each other. And they can't ping the router (or routers, I put 2, 1 on each of my 2 external networks)
There are some errors in rabbitmq…..log: 2022-11-12 08:53:37.155542+01:00 [error] <0.16179.2> missed heartbeats from client, timeout: 60s 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> closing AMQP connection <0.17357.2> (10.0.5.109:37532 -> 10.0.5.109:5672 - mod_wsgi:43:e50d8e69-7c76-4198-877c-c807e0a180d8): 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> missed heartbeats from client, timeout: 60s
There are some errors also in neutron-l3-agent.log 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task 'to message ID %s' % msg_id) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 297cacfadd764562bf09a1c5daf61958
Also in neutron-dhcp-agent.log 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6f1d9d0c51ac4d89b9c889ca273f40a0
A lot of errors in neutron-metadata.log 2022-11-11 22:01:44.152 43 ERROR oslo.messaging._drivers.impl_rabbit [-] [d7902e2c-eba9-40e4-b872-40e7ba7a39ec] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2022-11-11 22:01:44.226 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [028872e4-fcd1-4de5-b20c-8c5541e3c77f] AMQP server on 10.0.5.109:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
timeout …. waiting…. unreachable…. connectionerror….
Something is wrong, but I think it’s very difficult to find the problem. To difficult for me. « nc -v » works.
There are several things that can cause issues with Rabbit, or with services sending messages. Rabbit itself is not always to blame. Things I've seen cause issues before include: 1) Time not being in sync on all systems (covered that earlier) 2) DNS (it's always DNS, right?) 3) Networking issues like mismatched MTU 4) Nova being configured for a Ceph backend, but timing out trying to talk to the cluster (messages would expire while Nova waited on it)
I do not know what to do.
I can lose all data (networks, instances, volumes, etc). I can start again on a new config Do I do it with kolla-ansible -i multinode destroy?
Yeah, just do kolla-ansible -i multinode destroy after backing up your kolla configs.
Before switching to Yoga, I had a cluster under Xena. I kept my
configuration and a venv (python) with koll-ansible for Xena. Am I going back to this version? How without doing stupid things?
I can't see any good reason to roll back to Xena. Yoga should be fine.
Changing should be as simple as swapping your VENV, and using your Xena globals.yml, passwords.yml, inventory, and any other custom configs you had for that version.
Thanks a lot.
Franck VEDEL
Le 11 nov. 2022 à 23:33, Laurent Dumont <laurentfdumont@gmail.com> a écrit :
docker exec -it rabbitmq rabbitmqctl cluster_status
3) Networking issues like mismatched MTU
My MTU (between nodes ) is 9000…. I believe my problem is the MTU. I modified /etc/kolla/config/neutron.conf and /etc/kolla/config/neutron/ml2_conf.ini.conf then kolla-ansible -i multinode reconfigures (case 1 here: https://docs.openstack.org/newton/networking-guide/config-mtu.html <https://docs.openstack.org/newton/networking-guide/config-mtu.html>) I test again everything and functions that did not work work again but not all.... For example, instances get an ip through dhcp but can't ping the router, but on some networks it works. However, before the reboot of the servers, I had not had a problem with the MTU of 9000. I'm going back to a 1500 MTU on Monday on site. Thank you Eric!!! Franck VEDEL
Le 12 nov. 2022 à 15:10, Erik McCormick <emccormick@cirrusseven.com> a écrit :
On Sat, Nov 12, 2022 at 3:08 AM Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr <mailto:franck.vedel@univ-grenoble-alpes.fr>> wrote: Bonjour !
Output of the command
Cluster status of node rabbit@iut1r-srv-ops01-i01 ... Basics Cluster name: rabbit@iut1r-srv-ops01-i01.u-ga.fr <mailto:rabbit@iut1r-srv-ops01-i01.u-ga.fr>
Disk Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01
Running Nodes rabbit@iut1r-srv-ops01-i01 rabbit@iut1r-srv-ops02-i01
Versions rabbit@iut1r-srv-ops01-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2 rabbit@iut1r-srv-ops02-i01: RabbitMQ 3.9.20 on Erlang 24.3.4.2
Maintenance status Node: rabbit@iut1r-srv-ops01-i01, status: not under maintenance Node: rabbit@iut1r-srv-ops02-i01, status: not under maintenance
Alarms (none)
Network Partitions (none)
Listeners Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops01-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops01-i01, interface: 10.0.5.109, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@iut1r-srv-ops02-i01, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@iut1r-srv-ops02-i01, interface: 10.0.5.110, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Feature flags Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: maintenance_mode_status, state: enabled Flag: quorum_queue, state: enabled Flag: stream_queue, state: enabled Flag: user_limits, state: enabled Flag: virtual_host_metadata, state: enabled
So… nothing strange for me.
All containers are healthy nom (after delete rabbitmq and rebuild rabbitmq).
in addition to dhcp, communications on the network do not work. If I create an instance, it has no ip address by dhcp. If I give her a static ip, she can't reach the router. If I create another instance, with another static ip, they don't communicate with each other. And they can't ping the router (or routers, I put 2, 1 on each of my 2 external networks)
There are some errors in rabbitmq…..log: 2022-11-12 08:53:37.155542+01:00 [error] <0.16179.2> missed heartbeats from client, timeout: 60s 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> closing AMQP connection <0.17357.2> (10.0.5.109:37532 <http://10.0.5.109:37532/> -> 10.0.5.109:5672 <http://10.0.5.109:5672/> - mod_wsgi:43:e50d8e69-7c76-4198-877c-c807e0a180d8): 2022-11-12 08:54:54.026480+01:00 [error] <0.17357.2> missed heartbeats from client, timeout: 60s
There are some errors also in neutron-l3-agent.log 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task 'to message ID %s' % msg_id) 2022-11-11 22:04:42.512 37 ERROR oslo_service.periodic_task oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 297cacfadd764562bf09a1c5daf61958
Also in neutron-dhcp-agent.log 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 445, in get 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id) 2022-11-11 22:04:44.854 7 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6f1d9d0c51ac4d89b9c889ca273f40a0
A lot of errors in neutron-metadata.log 2022-11-11 22:01:44.152 43 ERROR oslo.messaging._drivers.impl_rabbit [-] [d7902e2c-eba9-40e4-b872-40e7ba7a39ec] AMQP server on 10.0.5.109:5672 <http://10.0.5.109:5672/> is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2022-11-11 22:01:44.226 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [028872e4-fcd1-4de5-b20c-8c5541e3c77f] AMQP server on 10.0.5.109:5672 <http://10.0.5.109:5672/> is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
timeout …. waiting…. unreachable…. connectionerror….
Something is wrong, but I think it’s very difficult to find the problem. To difficult for me. « nc -v » works.
There are several things that can cause issues with Rabbit, or with services sending messages. Rabbit itself is not always to blame. Things I've seen cause issues before include:
1) Time not being in sync on all systems (covered that earlier) 2) DNS (it's always DNS, right?) 3) Networking issues like mismatched MTU 4) Nova being configured for a Ceph backend, but timing out trying to talk to the cluster (messages would expire while Nova waited on it)
I do not know what to do. I can lose all data (networks, instances, volumes, etc). I can start again on a new config Do I do it with kolla-ansible -i multinode destroy?
Yeah, just do kolla-ansible -i multinode destroy after backing up your kolla configs.
Before switching to Yoga, I had a cluster under Xena. I kept my configuration and a venv (python) with koll-ansible for Xena. Am I going back to this version? How without doing stupid things?
I can't see any good reason to roll back to Xena. Yoga should be fine.
Changing should be as simple as swapping your VENV, and using your Xena globals.yml, passwords.yml, inventory, and any other custom configs you had for that version.
Thanks a lot.
Franck VEDEL
Le 11 nov. 2022 à 23:33, Laurent Dumont <laurentfdumont@gmail.com <mailto:laurentfdumont@gmail.com>> a écrit :
docker exec -it rabbitmq rabbitmqctl cluster_status
Hello. Thanks a lot Erik my problem was the MTU. if I go back to a situation with MTU=1500 everywhere, all is working fine !!! Is the following configuration possible and if so, how to configure with kolla-ansible files ? : 3 networks: - external (2 externals, VLAN 10 and VLAN 20): MTU = 1500 - admin:MTU=1500 - management : MTU = 9000 (a scsi bay stores volumes, with mtu 9000 ok). Like this: ` Thanks a lot if you have a solution for this. If impossible, I stay with 1500… it’s working, no problem. Franck
Le 12 nov. 2022 à 21:00, Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr> a écrit :
3) Networking issues like mismatched MTU
My MTU (between nodes ) is 9000….
I believe my problem is the MTU.
I modified /etc/kolla/config/neutron.conf and /etc/kolla/config/neutron/ml2_conf.ini.conf then kolla-ansible -i multinode reconfigures
(case 1 here: https://docs.openstack.org/newton/networking-guide/config-mtu.html <https://docs.openstack.org/newton/networking-guide/config-mtu.html>)
I test again everything and functions that did not work work again but not all....
For example, instances get an ip through dhcp but can't ping the router, but on some networks it works. However, before the reboot of the servers, I had not had a problem with the MTU of 9000.
I'm going back to a 1500 MTU on Monday on site.
Thank you Eric!!!
Franck VEDEL
On Mon, Nov 14, 2022 at 8:26 AM Franck VEDEL < franck.vedel@univ-grenoble-alpes.fr> wrote:
Hello. Thanks a lot Erik my problem was the MTU. if I go back to a situation with MTU=1500 everywhere, all is working fine !!!
Is the following configuration possible and if so, how to configure with kolla-ansible files ? :
3 networks: - external (2 externals, VLAN 10 and VLAN 20): MTU = 1500 - admin:MTU=1500 - management : MTU = 9000 (a scsi bay stores volumes, with mtu 9000 ok).
Like this: `
It is possible, but in some ways not advisable. Just from a general networking standpoint, I wouldn't set any interface used for traffic coming to / from the internet to use Jumbo Frames. Strange things happen when you start fragmenting to fit through standard internet routers, particularly when you run into something on the other end that is also using a large MTU. It's fine for internal management networks, storage networks, and the like. You could move your tenant / tunneling vlan over to a different interface and let that other one serve your internal needs. That being said, you need to account for VXLAN encapsulation overhead in your MTU considerations. Whatever your physical interface config is set to, your tenant networks need to use 50 bytes less. I think this is fine by default when using 1500, but can get weird when using Jumbo frames. If you put an override config file in /etc/kolla/config/neutron/ml2_ini.conf with something like: [ml2] path_mtu = 9000 it should tell Neutron to take that into account. -Erik Thanks a lot if you have a solution for this.
If impossible, I stay with 1500… it’s working, no problem.
Franck
Le 12 nov. 2022 à 21:00, Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr> a écrit :
3) Networking issues like mismatched MTU
My MTU (between nodes ) is 9000….
I believe my problem is the MTU.
I modified /etc/kolla/config/neutron.conf and /etc/kolla/config/neutron/ml2_conf.ini.conf then kolla-ansible -i multinode reconfigures
(case 1 here: https://docs.openstack.org/newton/networking-guide/config-mtu.html)
I test again everything and functions that did not work work again but not all....
For example, instances get an ip through dhcp but can't ping the router, but on some networks it works. However, before the reboot of the servers, I had not had a problem with the MTU of 9000.
I'm going back to a 1500 MTU on Monday on site.
Thank you Eric!!!
Franck VEDEL
participants (3)
-
Erik McCormick
-
Franck VEDEL
-
Laurent Dumont