On Thu, Mar 20, 2025, 08:44 Eugen Block <eblock@nde.ag> wrote:

Hi,

I haven't experienced a situation where it was required to restart
nova in a specific order, only with neutron under heavy load. So I'd
say you can restart nova without any order. Regarding rabbitmq,
cluster_status is one thing, but did you inspect the logs as well? Do
you see any rabbit issues or just the neutron/nova agents
disconnecting and reconnecting?

Years ago I helped a customer debugging issues after an outage. I
believe it was a single control node at that time. IIRC, neutron had
caused a OOM killer. So we restarted all kinds of services, but
nothing really worked. We were debugging for around 10 hours straight,
to no avail. We then decided to reboot the control node, and the
issues were gone. So we still didn't really know what was going on,
but what I've learned from that is, a reboot can be much easier than
trying to fix some broken processes. ;-) But at the same time, I try
to avoid reboots as much as possible if I see a chance of fixing
things without a reboot. :-)

Zitat von Winicius Allan <winiciusab12@gmail.com>:

> I’ve noticed that Nova agents are behaving similarly (some agents are
> reporting that they're down but weren't). Is the process of restarting the
> agents the same for the Neutron? Maybe the nova-api is the one last to be
> stopped and the first to start.
>
> Also, checking *rabbitmqctl cluster_status* the cluster is ok.
>
> Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
>
>> I would probably start with restarting the neutron agents. I’ve come
>> across comparable situations a couple of times, at least it seems like
>> it. Usually I stop all agents, neutron-server last. Then start
>> neutron-server first, then the other agents.
>> But if you had a wider outage, restarting rabbitmq might be necessary
>> as well. Check rabbitmqctl cluster_status and its logs to determine if
>> rabbitmq is okay.
>>
>> Zitat von Winicius Allan <winiciusab12@gmail.com>:
>>
>> > Digging into logs, I found that entries
>> >
>> > 2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent
>> > [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf
>> > req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764
>> > e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind
>> port
>> > 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id':
>> > 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch
>> agent',
>> > 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx,
>> > 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17,
>> 4,
>> > 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4),
>> > 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53),
>> > 'description': None, 'resources_synced': None, 'availability_zone': None,
>> > 'alive': False,
>> >
>> > ERROR neutron.plugins.ml2.managers
>> > [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf
>> > req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764
>> > e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port
>> > 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type
>> > normal using segments
>> >
>> > and I came up here[1]. I think it is a problem with RabbitMQ, which is
>> > reporting for neutron that the OVS agent is not alive although the
>> > container has a "healthy" status
>> >
>> > When I run "openstack network agent list" the output is inconsistent. One
>> > time, it shows that some agents are not alive, and another time it shows
>> > other agents as alive/not alive. Is running a rolling restart in RabbitMQ
>> > the way? Has anyone already faced this problem?
>> >
>> > [1]
>> >
>> https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/ml2/managers.py#L819
>> >
>> > Regards.
>> >
>> > Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
>> winiciusab12@gmail.com>
>> > escreveu:
>> >
>> >> Hello stackers!
>> >>
>> >> release: zed
>> >> deploy-tool: kolla-ansible
>> >>
>> >> After an outage, all the load balancers from my cluster came to
>> >> provisioning status ERROR because the o-hm0 interface was unavailable on
>> >> controller nodes. I've created them again and tried a failover on load
>> >> balancers. The octavia-worker logs show that failover was completed
>> >> successfully, but the provisioning status remains with ERROR.
>> >> Looking into nova-compute logs, I see that an exception was raised
>> >>
>> >> os_vif.exception.NetworkInterfaceNotFound: Network interface
>> >> qvob8ac0f5f-46 not found
>> >>
>> >> The instance id for that interface matches the new amphora id. I'll
>> attach
>> >> the nova-compute logs here[1]. In neutron-server logs, there is no
>> ERROR or
>> >> WARNING that I can found, the only suspicious entry is
>> >>
>> >> Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE
>> because
>> >> it is not bound. _port_provisioned
>> >>
>> /var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
>> >>
>> >> Can anyone give me a light?
>> >>
>> >> [1] https://pastebin.com/hV3xgu0e
>> >>
>>
>>
>>
>>