Re: [octavia][nova][neutron] failed to failover loadbalancer

20 Mar 2025

      I’ve noticed that Nova agents are behaving similarly (some agents are
reporting that they're down but weren't). Is the process of restarting the
agents the same for the Neutron? Maybe the nova-api is the one last to be
stopped and the first to start.

Also, checking *rabbitmqctl cluster_status* the cluster is ok.

Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
...
I would probably start with restarting the neutron agents. I’ve come
across comparable situations a couple of times, at least it seems like
it. Usually I stop all agents, neutron-server last. Then start
neutron-server first, then the other agents.
But if you had a wider outage, restarting rabbitmq might be necessary
as well. Check rabbitmqctl cluster_status and its logs to determine if
rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
...
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent
[req-f4e98255-be23-4d7a-9d4a-c680b1320bdf
req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764
e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind
port
6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id':
'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch
agent',
'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx,
'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17,
4,
25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4),
'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53),
'description': None, 'resources_synced': None, 'availability_zone': None,
'alive': False,
ERROR neutron.plugins.ml2.managers
[req-f4e98255-be23-4d7a-9d4a-c680b1320bdf
req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764
e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port
6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type
normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is
reporting for neutron that the OVS agent is not alive although the
container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One
time, it shows that some agents are not alive, and another time it shows
other agents as alive/not alive. Is running a rolling restart in RabbitMQ
the way? Has anyone already faced this problem?
[1]
https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
winiciusab12@gmail.com>
...
escreveu:
...
Hello stackers!
release: zed
deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to
provisioning status ERROR because the o-hm0 interface was unavailable on
controller nodes. I've created them again and tried a failover on load
balancers. The octavia-worker logs show that failover was completed
successfully, but the provisioning status remains with ERROR.
Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface
qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll
attach
the nova-compute logs here[1]. In neutron-server logs, there is no
ERROR or
WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE
because
it is not bound. _port_provisioned
/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
...
Can anyone give me a light?
[1] https://pastebin.com/hV3xgu0e