I’ve noticed that Nova agents are behaving similarly (some agents are reporting that they're down but weren't). Is the process of restarting the agents the same for the Neutron? Maybe the nova-api is the one last to be stopped and the first to start. Also, checking *rabbitmqctl cluster_status* the cluster is ok. Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
I would probably start with restarting the neutron agents. I’ve come across comparable situations a couple of times, at least it seems like it. Usually I stop all agents, neutron-server last. Then start neutron-server first, then the other agents. But if you had a wider outage, restarting rabbitmq might be necessary as well. Check rabbitmqctl cluster_status and its logs to determine if rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False,
ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem?
[1]
https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
winiciusab12@gmail.com>
escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned
/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?