I also have 2 control nodes.

But since my first message, I have other problems. Some new instances (for some accounts, on some l3 networks) do not receive their addresses through DHCP. For example, even for the admin@default account, I have these problems.

With my test account too, whereas if I ask a user to try, on some networks it works and some doesn't.

On the other hand, for all, everything works on external networks (even dhcp).

Well, this all seems impossible to fix. I'm not sure I can do what you say because I don't know the order of services. And there are so many logs. I look carefully at these logs, but I can't find the exact problem.

I plan to reinstall because it's an Openstack used by students every year, and there I have a month to get it working. It's just demoralizing to have to redo everything when the system was working so well. On the other hand, it would be an opportunity to correct the mistakes I made (in particular, switching from CentosStream to Ubuntu for servers and containers).

thanks again

Franck

Le 10 nov. 2022 à 19:09, Eugen Block <eblock@nde.ag> a écrit :

Hi,

this sounds very similar to something I experienced a couple of times this year. In a HA cloud with two control nodes (the third joined just recently) when one node was shut down (accidentally) I saw basically the same effects you're describing. I could create new networks and instances were started successfully and also got their IPs via DHCP while existing VMs didn't properly work (at least the dhcp part for self-service networks). I'm still not sure what exactly the root cause is as I can't reproduce it in my test lab, and retrying it in a production cluster is not a good idea. ;-)
I got things to work, but it's still unclear what exactly it was. It's possible that you could see hints in the neutron logs that something's not right, I don't recall the exact message but it was something like "dhcp agent doesn't work because the server is overloaded". By the way, what is the number of dhcp agents per network you have in neutron.conf?
Briefly, here's what I did (at that time with 2 control nodes):
- put the pacemaker cluster into maintenance mode so I could stop and start services manually
- stopped all services except rabbitmq and galera
- made sure all services (like neutron) were actually "dead", so no left over processes
- started apache and haproxy on one node only so all requests would land there
- started one service after another manually and watched the logs
- now the dhcp agent started successfully and logged
- started the services on the remaining control node and everything was stable
- the cluster then recovered

I don't know if that helps in any way, but I thought I'd share. By the way, we don't use kolla so I can't really comment that part.

Regards,
Eugen

Zitat von Franck VEDEL <franck.vedel@univ-grenoble-alpes.fr>:

Hello,
after a restart of my cluster (and some problems...), I have one last problem with the VMs already present (before the restart).
They all work fine….They all work, console access OK, network topology ok…

But they can no longer communicate on the network, they do not obtain IP addresses by dhcp. Yet everything seems to be working.
If I detach the interface, I create a new interface, it doesn't work. I cannot reach the routers. I cannot communicate with an instance on the same network.
On the other hand, if I create a new instance, no problem, it works and can join the other instances and its router.
Is there a way to fix this? The problem is where? in the database?
Thank you in advance for your help.

Franck VEDEL