I’ve noticed that Nova agents are behaving similarly (some agents are reporting that they're down but weren't). Is the process of restarting the agents the same for the Neutron? Maybe the nova-api is the one last to be stopped and the first to start.
Also, checking rabbitmqctl cluster_status the cluster is ok.
Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <
eblock@nde.ag> escreveu:
I would probably start with restarting the neutron agents. I’ve come
across comparable situations a couple of times, at least it seems like
it. Usually I stop all agents, neutron-server last. Then start
neutron-server first, then the other agents.
But if you had a wider outage, restarting rabbitmq might be necessary
as well. Check rabbitmqctl cluster_status and its logs to determine if
rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
> Digging into logs, I found that entries
>
> 2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent
> [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf
> req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764
> e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port
> 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id':
> 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent',
> 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx,
> 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4,
> 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4),
> 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53),
> 'description': None, 'resources_synced': None, 'availability_zone': None,
> 'alive': False,
>
> ERROR neutron.plugins.ml2.managers
> [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf
> req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764
> e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port
> 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type
> normal using segments
>
> and I came up here[1]. I think it is a problem with RabbitMQ, which is
> reporting for neutron that the OVS agent is not alive although the
> container has a "healthy" status
>
> When I run "openstack network agent list" the output is inconsistent. One
> time, it shows that some agents are not alive, and another time it shows
> other agents as alive/not alive. Is running a rolling restart in RabbitMQ
> the way? Has anyone already faced this problem?
>
> [1]
> https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/ml2/managers.py#L819
>
> Regards.
>
> Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <winiciusab12@gmail.com>
> escreveu:
>
>> Hello stackers!
>>
>> release: zed
>> deploy-tool: kolla-ansible
>>
>> After an outage, all the load balancers from my cluster came to
>> provisioning status ERROR because the o-hm0 interface was unavailable on
>> controller nodes. I've created them again and tried a failover on load
>> balancers. The octavia-worker logs show that failover was completed
>> successfully, but the provisioning status remains with ERROR.
>> Looking into nova-compute logs, I see that an exception was raised
>>
>> os_vif.exception.NetworkInterfaceNotFound: Network interface
>> qvob8ac0f5f-46 not found
>>
>> The instance id for that interface matches the new amphora id. I'll attach
>> the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or
>> WARNING that I can found, the only suspicious entry is
>>
>> Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because
>> it is not bound. _port_provisioned
>> /var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
>>
>> Can anyone give me a light?
>>
>> [1] https://pastebin.com/hV3xgu0e
>>